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FOREWORD 


This  technical  report  is  the  First  Annual  Report  by  the 
University  of  Texas  at  Austin.  Linguistics  Research  Center, 
Austin,  Texas,  under  contract  F30602-70-C-01 1 8 ,  Job  Order 
Number  45940000,  for  Rome  Air  Development  Center,  Griffiss 
Air  Force  Base,  New  York.  It  covers  the  period  from  1 
February  1970  to  31  January  1971.  Sgt.  Charles  S.  Bond,  Jr. 
(IRDT)  is  the  RADC  Project  Engineer. 

This  report  has  been  reviewed  by  the  Information 
Office  (01)  and  is  releasable  to  the  National  Technical  In¬ 
formation  Service  ( N T I S ) . 

This  technical  report  has  been  reviewed  and  is  approved. 


ABSTRACT 


Research  in  theoretical  linguistics 
descriptive  linguistics,  lexicography, 
and  systems  design  pertinent  to  the  Lin¬ 
guistics  Research  System  for  mechanical 
translation  performed  at  the  Linguistics 
Research  Center  is  described.  Work  in 
the  theoretical  group  concentrated  on 
i n t ra-sen ten t i a  1  disambiguation  and  on 
improving  certain  parts  of  the  system  to 
achieve  greater  economy  in  processing. 
The  linguistic  group  was  engaged  in  cor¬ 
recting  and  updating  the  existing  German 
and  English  lexical  data  bases  by  as¬ 
signing  syntactic  and  semantic  selection 
restrictions  to  lexical  items.  Work  in 
the  systems  group  concentrated  on  the 
reduction  of  the  size  of  the  existing 
LRS  lexical  data  base  without  infor¬ 
mation  loss,  on  the  conversion  of  this 
data  base  to  the  LRS  subscript  format, 
on  the  construction  of  supporting  pro¬ 
grams  to  expedite  and  facilitate  the  up¬ 
dating  of  the  LRS  word  lists,  and  on 
the  construction  of  part  of  the  LRS  gram 
mar  maintenance  and  systems  programs. 
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INTRODUCTION 


The  difficulties  that  confront  attempts  to  mechanically 
recognize  and  produce  sentences  in  natural  language  generally 
arise  from  two  causes.  One  is  the  Uck  of  a  lexicon  with  pre¬ 
cise  information  on  the  syntactic  and  semantic  properties  of  the 
context  in  which  these  lexical  items  may  occur.  The  other  source 
of  difficulty  is  the  concomitant  generality  of  the  recognition 
grammars  which  is  necessary  in  order  to  keep  the  number  of  re¬ 
quired  rules  to  a  manageable  size.  As  a  result  of  this  gener¬ 
ality,  sentences  are  assigned  a  vast  number  of  readings  ("forced 
ambiguities")  in  addition  to  their  legitimate  readings. 

These  difficulties  did  not  change  with  the  advent  of  trans¬ 
formational  recognizers  l7«  S»  10,  1/,.]  in  1964.  Due  to  the  lack 
of  comprehensive  grammars  and  a  complete  set  of  transformational 
rules,  there  recognizers  cannot  be  used  for  the  analysis  of  arbi¬ 
trary  sentences  in  natural  language.  (Cf.  [8j\  It  may  also  be 
significant  that  the  advances  in  the  theory  of  transformational 
grammar—  the  incorporation  of  a  lexical  component  with  semo- 
syntactic  and  semantic  features,  Lite  introduction  of  outout 
constraints,  derivational  cons  t  ra  i  r,  ts  ,  anc  t  ran&der  i  var  i  ona  1 
constraints—  have  not  been  i  n  cargo  rated  into  transformational 
recognition  procedures. 

The  dissatisfaction  with  transformational  grammar  as  de¬ 
scribee  in  A&peet&  0((  the  Theory  oq  Syntax  [2]  has  led  in  the 
meantime  to  a  schism  among  generative  linguists,  with  the  uni¬ 
versal  base  hypothesis  opposing  an  "extended"  standard  trans 
formational  grammar.  Moreover,  the  general  disaffection  for 
the  concept  of  a  grammar  as  a  device  which  geierates  individMal 
sentences  can  be  observed  from  various  attempts  to  tackle  the 
problem  of  producing  or  recognizing  sentences  in  discouise  by 
positing  so-called  text  grammars  - 

We  feel  th  t  the  difficulties  in  the  Droduction  of  trans¬ 
formational  grai.mars  for  larguage  are  mainly  dj«  to  the  unneces¬ 
sary  complexity  of  the  transformational  apparatus.  A  transfor¬ 
mational  grammar  was,  originally,  a  device  which  generates  all 
and  on  1 y  tha  grammatical  sentences  cf  a  (surface)  language. 

The  grammar  supposedly  gen  rated  deep  s  t  r  uc  t.  n  •*«$  from  which— 
by  means  of  transformations-  well -formed  surface  strings  were 
de  r i ved . 

The  advent  of  A 6pecti>  with  its  lexical  component  and  em¬ 
bedding  of  sentences  increised  the  power  of  the  phrase-structure 
component;  it  was  now  able  to  generate  well-formed  and  in¬ 
formed  sentences.  fhe  transformational  component  obtained  an 
additional  function,  the  "filtering  function,"  whose  purpose  was 
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to  delete  all  output  strings  which  were  not  well-formed.  These 
could  be  recognized  from  the  occurrence  of  "non-surface"  termin¬ 
als:  dummy  symbols,  and  Internal  sentence  boundary  marks.  That 

th/i  s  filtering  function  did  not  suffic*  to  eliminate  all  ill- 
formed  strings  has  been  shown.  And,  sc  far,  the  additional  con¬ 
ditions  stated  above  have  had  to  be  introduced  in  order  to 
guarantee  the  well-formedness  of  the  output  string. 

With  this  in  mind  the  question  naturally  arises—  Why  not 
guarantee  the  well-formedness  of  the  output  string  by  means  of 
an  output  grammar?  It  is  certainly  interesting,  if  not  signif¬ 
icant,  that  the  centers  which  have  made  the  most  Important 
advances  in  the  analysis  of  .sentences  in  natural  language  (the 
Transformation  and  Discourse  Analysis  Project  at  the  University 
of  Pennsylvania,  and  the  CETA  group  in  Grenoble)  operate  with  a 
transformational  apparatus  but  with  a  surface  grammar.  Th<* 
addition  of  a  surface  grammar  component  has  an  obvious  advantage 
in  that  the  linguist  is  able  to  describe  the  strings  of  language 
in  a  manner  which  has  been  long  familiar  to  him  and  which  lin¬ 
guistic  tradition  has  used  for  centuries.  The  transformational 
component  can  then  be  considerably  simplified.  In  particular, 
the  ordering  of  transformations  which  had  originally  been 
necessary  to  guarantee  well-formedness  of  the  output  string  can 
now  be  taken  over  by  the  surface  grammar. 

In  conclusion  —  past  experience  has  clearly  demonstrated 
that,  due  to  the  large  number  of  rules  required,  surface  analysis 
by  means  of  context-free  grammars  with  simple  symbols  cannot  be 
performed.  Further,  <1  grammar  appropriate  for  surface  analysis 
must  permit  the  linguist  to  express  the  linguistic  generali¬ 
zations  that  he  has  been  accustomed  to  make:  that  a  sequence 
of  constituents  forms  a  cons  i tuent  only  if  each  constituent  has 
the  syntactic  and  semantic  properties  required  for  the  well- 
formedness  of  the  string. 


In  the  remainder  of  this  report  we  give  a  general  outline 
of  the  linguistics  research  system  (LRS),  a  grammatical  model 
for  the  mechonical  recognition  and  production  of  sentences  in 
natural  ’anguage  used  for  machine  translation  purposes.  A  more 
comprehensive  description  is  given  in  [5]. 

During  this  contracc  period,  the  theoretical  group  at  IRC 
concentrated  on  disambiguation  of  sentences  and  on  improving  cer¬ 
tain  parts  of  the  system  to  achieve  greater  economy  in  processing. 
Detailed  descriptions  can  be  found  in  [6].  The  linguistic  group 
was  engaged  in  correcting  and  updating  the  existing  German  and 
English  lexical  data  bases  by  assigning  syntactic  and  semantic 
selection  restrictions  to  lexical  items.  The  systems  group  was 
concerned  with  a)  reducing  the  size  of  existing  LRS  dictionaries 
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without  loss  of  information  while  converting  them  to  the  new 
LRS  format;  b)  constructing  supporting  programs  to  expedite 
and  facilitate  the  updating  of  the  LRS  lexical  data  bases;  and, 
c)  constructing  a  part  of  the  LRS  grammar  maintenance  and  LRS 
systems  programs.  Detailed  descriptions  of  these  activities 
can  be  found  in  Sections  il,  III.  and  IV. 


SECTION  I 


THEORY 

THE  LINGUISTICS  RESEARCH  SYSTEM 


The  purpose  of  the  Linguistics  Research  System  (LRS),  which 
is  being  constructed  under  this,  contract  at  the  Linguistics  Re¬ 
search  Center  of  the  University  of  Texas  at  Austin,  is  to  provide 
a  description  and  explanation  of  human  linguistic  capabilities  by 
performing  recognition  and  production  of  sentences  in  natural 
language,  In  order  to  achieve  mechanical  translation.  The  LRS  is 
a  system  of  components  which  can  be  connected  like  building 
Hocks  to  form  larger  configurations.  Each  component  consists  of 
a  set  of  algorithms  and  instructions  which  are  executed  by  the 
algorithms  and  which  modify  the  general  operations  of  the  algo¬ 
rithms  in  a  prescribed  way.  Such  instructions  are  linguistic 
rules  of  various  kinds:  dictionary  rules,  syntactic  rules,  and 
Interpretation  rules,  transformation  rules,  mapping  rules,  selec¬ 
tion  rules,  rejection  rules,  and  others. 

The  LRS  is  based  on  the  following  linguistic  assumptions: 

1)  that  grammatical  relations  can  be  more  easily  and 
correctly  stated  for  so-called  standard  strings  than  for 
surface  strings; 

l)  that  surface  information  is  necessary  for  correct 
semantic  i nterpretat 1 o t ; 

3)  that  synonymous  sentences  can  be  reduced  to  the 
sa.ie  "universal"  representat  i  oi.. 

In  Its  basic  configuration  the  LRS  is  a  grammatical  model 
for  the  recognition  and  production  of  synonymous  surface  sen¬ 
tences  with  identical  or  different  deep  structures.  By  deep 
structures  we  mean  the  stage  of  a  sentence  derivation  in  standaM 
transformational  grammar  when  all  base  component  rules,  consti¬ 
tuent  and  feature  rewriting  rules,  have  applied  but  before  lexi¬ 
cal  insertions  have  been  performed. 


1 . 1  Canonical  Forms 

The  purpose  of  this  model  is  to  associate  with  each  sentence 
in  a  natural  language  all  its  semantic  readings  or  canonical 
forms  (KF),  and  to  derive  from  a  given  KF  t  all  sentences  with 
the  semantic  reading  t.  A  sentence  which  ha>  n  distinct  semantic 
readings  has  n  distinct  KF 1 s .  Two  different  sentences  t  and  u 
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which  have  one  semantic  reading  in  common  have  one  KF  in  common. 
Sentences  of  different  languages  which  are  translations  of  one 
another  have  at  least  one  KF  in  common. 


A  canonical  form  consists  of  a  sequence  of  connected  simple 
KF  expressions.  K,  the  language  of  KF 1 s ,  has  the  following  pro¬ 
perties: 

a)  Each  simple  KF  expression  is  a  primitive  element  of 
K  (i.e.,  it  has  one  and  only  one  [atomic]  semantic  interpre¬ 
tation).  If  a  surface  terminal  q  has  n  different  senses, 
then  n  different  KF  expressions  (simple  or  connected)  repre¬ 
sent  the  different  senses  of  q. 

b)  No  two  different  KF  expressions  p  and  q  are  syno¬ 
nymous.  If  two  surface  terminals  have  one  sense  in  common, 
then  that  reading  is  represented  by  the  same  KF  expression. 


l  .2  Kp  rira.l,f.Qjjnia 

Because  of  the  difficulties  involved  in  the  construction  of 
KF's,  LRS  represents  the  meaning  of  sentences  by  means  of  normal 
forms  (NF) . 

The  NF's  of  a  language  are  distinct  from  the  KF's  in  that 
NF  lexical  primitives  may  represent  either  atomic  (simple)  or 
molecular  (connected)  KF  expressions.  Thus  the  NF  primitive, 
bachiio^i,  corresponds  to  the  connection  of  the  four  simple  KF 
expressions  unmarried,'  human •  adult -male. 


1  .  3  Mechanical  Translation 


The  process  of  deriving  from  a  surface  sentence  t  all  the 
NF's  of  t  is  performed  by  the  following  components: 

the  surface  component 

the  standard  component 

the  normal  form  component. 

The  surface  component  assigns  to  each  surface  sentence  t  all 
its  syntactic  readings  according  to  the  surface  grammar,  and  de¬ 
rives  from  those  a  tentative  standard  string  by  means  of  the 
transformation  instructions  contained  in  the  rules  which  apply 
to  l.  Tentative  standard  strings  consist  of  complex  standard 
terminal  symbols.  These  are  surface  terminals  with  their  (possi¬ 
bly  disambiguated)  dictionary  interpretation,  and  dummy  symbols 
which  were  introduced  by  the  transformations.  Dummy  symbols  re¬ 
present  grammatical  morphemes  and  elided  lexical  items.  Elements 
which  were  discontinuous  in  the  surface  are  contiguous  in  the 
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tentative  standard  strings. 

The  standard  component  then  analyzes  these  strings  with  the 
standard  grammar  which  assigns  a  standard  description  to  all  well- 
formed  strings,  and  filters  out  all  ill-formed  strings. 

The  NF  component  finally  interprets  the  readings  of  the  re¬ 
maining  standard  strings  by  means  of  the  NF  grammar  which  assigns 
NF  expressions  to  individual  or  connected  standard  subtrees . 

Production,  the  reversal  of  the  recognition  process,  is  also 
performed  in  three  steps. 

a)  The  normal  form  component—  by  means  of  the  NF  gram¬ 
mar  of  the  output  language—  derives  from  the  NF  reading  of 
the  input  sentence,  which  is  identical  to  the  NF  reading  of 
the  output  language,  all  the  associated  tentative  standard 
readings  of  the  output  string  t. 

b)  The  standard  component—  by  means  of  the  conditions 
and  operations  stated  in  the  standar'  grammar  rules  of  the 
output  language—  selects  all  well-formed  standard  readings 
from  the  tentative  standard  reading,  and  filters  out  all  ill- 
formed  readings.  The  standard  component  then  associates 
with  each  standard  reading  the  corresponding  tentative  sur¬ 
face  strings. 

c)  The  surface  component—  by  means  of  the  rearrange¬ 
ment  grammar  of  the  output  language—  then  assigns  a  surface 
description  to  all  well-lormed  surface  strings  and  filters 
out  all  ill-formed  surface  strings,  i.e.,  those  which  are 
either  not  accepted  or  do  not  meet  the  output  conditions  of 
the  rearrangement  grammar.  The  transformation  instructions 
associated  with  the  rearrangement  rules  finally  delete  the 
standard  dummy  symbols,  reintroduce  lexical  pieces  which 
had  been  deleted  after  surface  analysis,  and  rearrange  the 
remaining  terminals  in  surface  word  order. 


I . 4  Subscript  Grammars 

Four  grammars —  surface,  standard,  normal  form,  and  rear¬ 
rangement —  exist  for  each  language.  The  non- termi  na 1  and  termi¬ 
nal  vocabulary  symbols  of  each  grammar  are  complex  symbols,  ex¬ 
cept  for  the  terminal  symbols  of  the  surface  grammar.  Each  com¬ 
plex  symbol  consists  of  a  category  symbol  and  zero  or  more  sub¬ 
script  or  feature  symbois;  each  subscript  may  have  zero  or  more 
values. 

The  grammar  rules  used  during  the  recognition  and  production 
of  sentences  ( both  of  which  are  performed  as  a  bot tom- to- top 
direct  substitution  analysis),  are  generated  by  the  processing 
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algorithms  by  means  of  instructions  represented  as  context-free 
rule  schemata.  A  rule  schema  successfully  analyzes  a  string  of 
vocabulary  symbols  if  each  rule  constituent  is  ncn-distinrt  from 
the  symbol  it  analyzes,  and  if  all  the  relations  stated  between 
rule  constituents  in  the  rule  schema  hold  for  the  corresponding 
analyzed  symbols. 

If  a  rule  schema  is  successfully  applied,  a  new  vocabulary  s/mbol 
is  constructed  according  to  the  instructions  stated  in  the  ante¬ 
cedent  of  the  rule  schema. 

The  conditions  that  may  be  stated  for  individual  constituents 
in  a  rule  consequent  are: 

a)  A  particular  category  symbol  either  may  not  or  must 
contain  a  particular  subscript  or  combination  of  subscripts; 

b)  A  particular  subscript  symbol  may  not  or  must  con¬ 
tain  a  particular  value  or  combination  of  values; 

c.)  Operations  between  subscripts  of  different  consti¬ 
tuents  may  not  or  must  be  successful.  (These  operations, 
the  set- theoret i ca 1  operations  Intersection,  Sum,  and  Dif¬ 
ference,  are  performed  with  the  values  of  the  specified 
sub  scripts.) 

The  advantages  of  a  subscript  grammar  are  numerous.  It  per¬ 
mits  the  expression  of  relations  such  as  agreement  and  government 
which  corre-  ond  to  the  intuition  of  the  human  speaker.  Similar¬ 
ly,  grammatical,  semantic,  and  stylistic  categories  can  be  con¬ 
veniently  expressed. 


1 . 5  Syntactic  Grammars 

Each  rule  schema  of  each  grammar  consists  of  a  syntactic 
part  a  1  an  optional  t rans format !ona 1  part.  For  surface  and 
standard  grammar,  the  syntactic  part  of  each  rule  schema  consists 
of  context-free  rewrite  rule;-.  The  transformational  part  contains 
only  transformations  whose  structural  description  is  satisfied  by 
a  string  of  symbols  interpreted  by  the  constituents  of  the  rule 
schema  consequent.  The  transformations  possible  in  surface  anti 
standard  grammar  are  permutations,  deletions,  and  insertions. 

The  transformations  are  "feature  sensitive";  in  particular,  it  is 
possible  to  lexicalize  features  of  a  constituent  and  to  "featur- 
i ze"  terminal  or  non-terminal  constituents.  Thus,  words  like  up 
which  form  a  lexical  unit  with  some  verbs,  e.g.,  Zook  AomeZhZt 
up,  can  be  assigned  as  a  featur;  to  the  head  of  the  verbal  con* 
struction,  resulting  in  Zook  6  on'ethZng . 

+  up 

1.6  Normal  Form  Grammar 

The  rules  of  the  nF  Grammar  differ  from  surface  and  standard 
rules  in  two  respects: 
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a)  They  apply  to  connected  trees; 

b)  They  are  not  rewrite  rules. 

An  NF  rule  applies  to  all  trres  (terminal,  non-terminal,  or  com¬ 
binations  of  them)  whose  no  ,es ,  labeled  by  complex  symbols,  are 
non-distinct  from  the  complex  symbols  in  thr..  consequent  of  the 
NF  rule.  The  antecedent  of  the  NF  rule  assigns  a  particular 
semantic  reading,  an  NF  expression  represented  by  that  antece¬ 
dent,  to  all  trees  to  which  it  applies.  **ince  NF  expressions  ap¬ 
ply  iO  trees  whose  nodes  are  labeled  bv  complex  symbols,  it  is 
possible  to  assign  a  particular  NF  reading  to  a  terminal  k  with 
a  particular  pert-of-speech  interpretation  and  with  a  particular 
selection  restriction.  At  thi  Some  time,  all  trees  ti,  t2,...tn 
interpreted  by  the  same  NF  expression  I;  are  substitutable  for  one 
another,  regardless  of  whether  the  root  and  erd  nodes  of  tree  tj 
are  identical  or  different  from  those  of  tree  t j . 

It  is  thus  possible  to  define  synonymy  relations  between 
words  of  different  part-of-speech  and  between  different  syntactic 
structures  and  terminal  structures  (e.g.,  lexical  units  and  idio¬ 
matic  expressions;  lexical  units  and  phrasal  expressions;  and, 
lexical  units  which  have  an  internal  variable  slot),  without  af¬ 
fecting  their  transformational  possibilities.  Examples  of  such 
paraphrases  can  be  found  in  [5],  pp.  T 2 1 7 ~ 6 8 . 


1 . 7  Analysis  Procedure 

The  recognition  and  production  of  strings  i s  : pe rf ormed  as  a 
bottom- to-top  analysis.  We  believe  that  analysis  procedures  like 
those  of  Earley  [3]  or  those  based  on  s ta te- t rans 1 t i op  diagrams 
[I,  9,  H]  do  not  operate  as  efficiently  with  LRS  grammars  due  to 
the  complexity  of  their  symbols  and  the  large  number  of  permuta¬ 
tions  of  constituents  ’.yi'ical  of  highly  inflected  languages  such 
as  German. 

We  selected  bot tom- to- top  analysis  for  the  reasons  which 
follow. 

a)  It  permits  an  easier  treatment  of  ill-formed  strings 
(k-strings)  within  well-formed  strings  which  occur  frequent¬ 
ly  in  translations,  e.g.,  formulas,  foreign  names,  foreign 
citations,  etc . . 

b)  It  permits  the  adding  of  new  syntactic  or  semo-syn- 
tactic  values  to  the  lexicon  without  a  concurrent  change  of 
the  non-terminai  grammar  rules-  Assume,  for  example,  that 
one  discovers  a  sub-class  of  adjectives  which  modify  only  a 
certain  type  of  nouns.  The  addition  of  the  new  semantic 
feature  under  the  subscript  "type"  only  requires  changing 
the  dictionary  rules  for  the  nouns  and  adjectives  affected. 
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None  of  the  word  formation  rules  or  syntactic  rules  will 
need  to  be  changed.  This  advantage  would  be  lost  In  a  top- 
to  bottom  analysis  where,  in  addition  to  the  dictionary 
rules,  the  tables  for  the  subscript  "type"  for  nouns  and  for 
adjectives  would  have  to  be  changed. 

c)  Finally,  tree  structures  which  interpret  ambiguous 
strings  can  be  conflated  to  a  tingle  tree  structure  if  all 
labels  of  the  tree  nodes  have  the  same  category  symbol. 

The  number  of  intermediate  analyses,  similar  to  state  tran¬ 
sition  diagrams,  is  thus  considerably  reduced.  A  similai 
conflation  occurs  in  the  representation  of  the  normal  forms 
of  sentences  which  contain  semanticatly  ambiguous  items. 

The  economy  of  this  analysis  procedure  was  further  increased 
by  the  introduction  of: 

a)  1  eft- con  text- sens i 1 5 ve  dictionary  analysis  (cf. 
3.4.1); 

b)  intermediate  choice  algorithms  which — based  on 
well-formedness  conditions —  destroy  all  inappropriate 
readings  after  dictionary  and  word  analysis;  and, 

c)  context-sensitive  rejection  rules,  which  apply 
during  word  analysis  and  whose  instructions  are  executed 
during  word  choice.  Word  Choice  tags  all  those  nodes  on 
which  no  syntactic  rule  may  build  within  the  analyzed  text. 


1 . 8  I  n t ra- Sen  ten t i a  1  Disambiguation 

The  most  powerful  feature  is  the  syrtem's  capability  of  per¬ 
forming  semantic  disambiguation  of  lexical  items  ’n  context  after 
sentence  analysis  without  having  to  trace  down  the  tree  branches 
from  the  node  S.  This  is  made  possible  by  means  of  trace  opera¬ 
tors  which  are  associated  with  the  disambiguating  values  of  am¬ 
biguous  lexical  items.  These  operators  cause  the  system  to  re¬ 
member  the  location  of  these  lexical  items  and  to  disambiguate 
them  only  if  a  disambiguating  environment  is  given. 


1 .9  Changes  in  Subscript  Grammar  Format  and  Storage 


Certain  me  ifications  in  the  fo'.at  of  writing  and  storing 
subscript  rules  were  made  during  the  reporting  year.  The  most 
significant  are: 

1)  the  now -permissible  separation  of  condition  and 
operation  statements  in  subscript  rules,  and, 

2)  the  method  of  storing  the  grammar  for  actual 
analysis. 
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1.9.1  New  Format  for  Operation  Statements 

The  encoding  scheme  be’ow  was  introduced  in  order  to  elimi¬ 
nate  the  ambiguity  resulting  from  two  or  more  linked  operations. 


For  example,  consider 

the  rule: 

<D 

© 

© 

© 

C  5  v  NP 

-  V  DET 

V  A 

V  N 

-  3 . 1  GO 

.  4 . 1 GD 

$  GD 

(The  encircled  digits 

identify  the 

rule  fields. 

)  Under  the  old 

convention  it  is  not  obvious  whether  the  operation  in  field  2 
(i.e,,  -3.1GD)  means:  perform  the  difference  operation  between — 

a)  the  value  set  of  the  workspace  subscript  GO  for  OET  and 

the  value  set  of  the  workspace  subscript  GO  *or  A, 
or 

b)  the  value  set  of  the  workspace  subscript  GO  for  OET  and 

the  value  set  resulting  from  the  intersection  indicated 
at  3.1. 

In  the  first,  a),  the  operations  at  2.1  and  3.1  are  disjoint 
and  can  be  done  in  any  order.  In  the  second,  b) ,  the  operation 
stated  in  3*1  must  be  done  first. 


The  operation  statement  for  a  subscript  may  now  be  separated 


from  its  condition  statement.  Rule 

12, 

wh  i  ch 

coded  as 

C  12  V  NO 

=  V  A 

V 

N 

or 

.  3.1GD 

$ 

GD 

C  12  V  NO 

-  V  A 

V 

N 

may  now  be  encoded  as 

$  GD 

• 

2 . 1  GD 

C  12  V  NO 

-  V  A 

V 

N 

$  GD 

.  2. 1  ,3.  1 

$ 

GD 

The  statement  " .  2.1,3 

. 1 "  represents 

"perform 

between  th,s  value  sets 
and  3.1",  i.e.,  GD  of 

of  the  subscr i pt 
A  and  N. 

names 

Since  the  system  treats  a  separated  operation  statement  as 
if  it  were  also  a  subscript,  sequences  of  linked  operations  can 
be  stcited  In  a  straightforward  manner.  Thus,  for  example,  read 
ing  b)  in  Ru>  5,  above,  can  now  be  represented  as  — 

C  5  V  NP  -  V  DET  V  A  V  N 

$  GD  $  GD  $  GD 

-2. 1,3. 2  .  3  .  1  » 4  .  1 

whereas  reading  a)  is  represented  as — 
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C  5  V  NP  *  V  DET  V  A  V  M 

$  GO  $  GD  $  GO 

-  2.1.3-1  .3.1  ,4.1 

No  condi  t\f'u  is  imposed  on  the  position  where  separated  ope¬ 
ration  statements  may  occur.  It  is  thus  possible  to  place  them 
in  the  most  advantageous  position  from  a  processing  point  of 
view,  i.e.,  the  left-most  constituent  in  a  rule  consequent,  as 
for  version  b)  at  the  bottom  of  the  preceding  page — 

C  5  V  NP  -  V  DET  V  A  V  NP 

$  GD  $  GD  $  GD 

.  3.1,4.) 

-  2. 1  ,2.2 


1.S.2  Storage  of  Analysis  Grammars 

The  manner  in  which  the  word  and  syntax  grammars  are  stored 
has  a  great  influence  on  the  speed  of  analysis.  After  investi¬ 
gating  how  the  word  and  syntax  algorithms  would  operate,  a 
storage  structure  using  a  reverse  columnar  approach  was  chosen. 
The  grammar  in  question  is  stored  by  columns,  the  first  column 
containing  all  the  unique  last  terms  of  rule  consequents.  The 
succeeding  columns  contain  the  penultimate  terms,  the  antepenul¬ 
timate  terms,  etc..  Associated  with  each  term  is  a  list  of  rule 
numbers  in  which  it  occurs.  Each  terminal,  i.e.,  the  left-most 
term  of  a  rule  consequent,  is  marked  and  has  a  pointer  to  its 
antecedent  term. 

The  analysis  programs  construct  the  actual  rule  by  means  of 
the  analyzed  individual  terms  and  their  associated  rule  numbers. 
Since  each  unique  nth  rule  term  is  stored  only  once,  the  method 
of  storing  the  grammar  as  described  above  should  facilitate  the 
analvsis  as  well  as  use  a  minimum  of  storage.  As  the  grammar 
might  exceed  available  core  memory  space,  the  storage  method  also 
ensures  that  most  or  all  of  the  grammar  that  the  analysis  program 
needs  at  one  time  is  kept  in  memory.  if  last  terms  are  being 
analyzed  for  instance,  all  the  last  terms  can  be  in  memory;  if 
penultimate  terms  are  being  analyzed,  all  the  penultimate  terms 
can  be  in  memory,  etc.  We  anticipate  that  this  method  of  stor¬ 
ing  the  grammar  will  result  in  a  considerable  increase  in  pro¬ 
cessing  speed . 


Examc 


!  nt ra- Sentent i a  1  Disambiguation 


The  capabilities  of  IRS  for  performing  intra-sentence  dis¬ 
ambiguation  may  oe  shown  by  the  analysis  and  standardization  of 
the  English  sentence 

Tfte  page  ilept 

In  thi;  sentence,  the  noun  page  is  ambiguous;  one  of  its  semantic 
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readings  is  the  reading  BOV,  another  the  reading  PAGINA.  This 
ambiguity  is  represented  in  the  dictionary  rule  2  below,  which 
applies  to  ihe  noun  page.,  by  the  subscript  TY  (for  type)  with 
the  values  HU  and  IN  for  KUman  and  INanimate.  This  ambiguity 
is  resolved  in  the  context  of  the  verb  &Jtzpt,  which  requires  an 
animate  subject,  indicated  by  the  subscript  TS  with  the  value 
AN  in  rule  6.  During  the  analysis  of  this  sentence,  the  rule 
schemata  apply  in  the  order  indicated  by  their  numbers. 

English  dictionary  and  grammar  rules: 


1  V  DET 

+  NU (S  ,  P) 

2  V  N 

+  TY (HU , I N ) 

+  CL  (05) 

T  1.1 

3  V  N 

$2.1 (X+AN+PO) 
A  2 

k  V  N 

+  NU (S) 

A  2 

5  V  NP 

+  PS (3) 

$*2.1 

A  3 

CHO  ICE 
S  m 

6  V  V 

+  CL ( 1 5) 

+  OB (0) 

+  TS (AN ) 

6  V  V 

CL (07) 

+  OB  (0 ) 

+  TS  (AN) 


*  THE 


*  PAGE 


V  N 

$  T Y ( *AN+*P0+HU ) 


V  N 

$  CL (0 1  , .  .05) 

V  DET  V  N 

.  3. 1NU  $  NU 

(m  -  2- 1 ) 

*  Slept 

*  SLEEP 


7  V  VP 

+  TN(PA) 

+  PS ( 1 ,2,3) 
+  NU ( S  ,  P ) 


V  V 

$  CL(.  ..15) 


VS  *  V  NP 

V  VP  D  # 

D  AUX 

$3.3  .  3.  1NU 

$  NU 

$  3.5 

$*2.3  .  3.2PS 

$  PS 

$  3.6 

.  3.4TY 

$  OB (*0)  1 

$*2. 1 

CHOICE 

A  1.1(3. 3) 

A  1.2(2. 3, 3. 4) 

$  TS  * 

$  TN  | 

$  VC (A)  | 

$*2.2 

S  n 

(n  *  2-4- 1 -3-5) 

Rule  1  assigns  the  word  the.  the  interpretation  DETerminer  and 
states  that  its  NUmber  is  Singular  or  Plural.  ' 

Rule  2  assigns  the  word  page  the  interpretation  Noun  of  the 
paradigmatic  CLass  05  and  the  values  HUman  and  INanimate  of  the 
subscript  TY .  The  subscript  T  1.1  indicates  that  the  values  in 
the  first  subscript  of  the  first  rule  term,  in  this  case  of  TY, 
represent  semantic  ambiguity.  The  effect  of  the  T  operator  is 
that  the  address  of  this  subscript,  given  in  brackets  in  the  tree 
diagram  below,  is  associated  with  the  subscript  TY. 

Rule  3  is  a  redundancy  rule  which  states  that  all  nouns  with  the 
value  HUman  which  have  neither  the  value  ANimate  nor  the  value 
Physical  Object  add  the  va 1 uer  ANimate  and  Physical  Object. 

The  expression  A  2  in  the  antecedent  is  an  instruction  to  the 
algorithm  to  carry  along  all  the  subscripts  of  the  second  con¬ 
stituent  not  mentioned  in  the  second  rule  term. 

Rule  4  states  that  nouns  of  particular  paradigmatic  CLasses,  if 
followed  by  zero  ending,  become  rouns  with  the  NUmber  Singular. 
Again,  A  2  results  in  the  carrying  along  of  the  non-ment 1 oned 
subscr i pts . 

Rule  5  states  that  the  sequence  of  DETerminer  and  Noun  results 
in  a  Noun  Phrase,  provided  that  the  DETerminer  and  the  Noun 
agree  in  NUmber.  The  instruction  .  3 • 1 N U  is  to  be  read  as 
"intersect  the  values  of  the  subscript  NU  with  the  values  of  the 
first  subscript  of  the  constituent  matched  by  the  third  rule 
term . "  The  Noun  Phrase  is  assigned  the  feature  "third  person" 
and  the  NUmber  in  which  the  two  terms  agree;  the  non-men t i oned 
subscripts  of  term  3  are  carried  along. 

Rule  6  assigns  the  word  6le.pt  the  reading  Verb  of  paradigmatic 
CLass  15.  0B(0)  stands  for  "requires  zero  object,"  TS(AN) 

stands  for  "the  subject  must  be  animate'.'11-  As  we  see  in  the 
next  rule,  alle.iorphs  of  a  morpheme  are  assigned  the  same  rule 
numbuer.  They  nave  in  common  all  subscripts  excep‘  for  the  sub¬ 
script  which  indicates  paradigmatic  CLass. 
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Rule  7  rewrites  all  Verbs  of  Class  15  as  full  VerBs  in  the 
PAst  TtNse,  in  the  first,  second  or  third  PerSon  and  in  the 
Number  Singular  or  Plural.  A  2  results  in  the  carrying  along 
of  all  features  of  the  underlying  verb. 

The  syntactic  part  of  rule  8  consists  of  the  first  three  terms 
which  rewrite  a  Noun  Phrase  followed  by  a  Verb  Phrase  as  a 
Sentence  provided  the  Noun  Phrase  and  the  Verb  Phrase  agree 
in  NUmber  and  PerSon  and  provided  th?t  the  TYp^  of  the  Noun 
Phrase  has  a  value  in  common  with  the  subscript  TS  of  the  Verb 
Phrase.  In  addition,  the  verb  phrase  must  dominate  an  intran¬ 
sitive  verb  (objects  of  transitive  verbs  are  dominated  by  S 
net  by  VP).  These  subscripts  are  artificially  associated  with 
S  to  permit  an  easier  execution  of  the  rule's  choice  sttement. 


The  application  of  these  rules  to  the  input  sentence  results  In 
the  following  analysis: 


8  S 
OB  (0) 

TY(AN) [8.1.1] 


Note  that 
"space  " 
occurs  as 
the  4th 
and  9th 
text  sym¬ 
bols  . 
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After  syntactic  analysis,  the  choice  statements  in  rule  8  are 
executed.  A  1.1(3. 3)  reads  "take  the  value  of  the  first  sub¬ 
script  if  fie i d  I  and  weight  It  In  the  address  associated  with 
the  th  i  ro  sufc~s  crT'pt  In  field  3  If  there  is  such  an  address." 

Thus  onl y  the  Tns t ruct i on  A  T.TTT7T5  of  A  1  .  2  (2 . 3 , 3  •  4)  i  s  execut¬ 
ed.  Syntactic  choice  also  introduces  the  dummy  terms  of  rule  8 
and  assigns  the  order  2-4- 1 ~3~5  to  the  terms  and  dummy  terms  in 
the  rule  consequent. 

The  standardization  program  tiiar  derives  the  following  dis¬ 
ambiguated  string,  where  the  noun  page  no  longer  has  the  features 
"human  or  inanimate"  but  only  "human, "  as  indicated  by  the  sub¬ 
script  and  value  TY(HU)  below. 


D 

#  C 

2 

C  1 

D  AUX 

C  6 

N 

DET 

$  PS (3) 

V 

TY  (HU) 

NU (S , P ) 

$  NU ( S ) 

CL  ( 1 5 ) 

CL  (05) 

$  TN (PA) 

TO  (0) 
TS(AN) 

The  noun  is 

assigned 

the  interpretation 

BOy  by  the 

V 

Boy 

=  C 

2 

L) 

0 

$ 

TY(HU) 

which  we  can  represent  by  the  graph 


Had  the  input  sentence  been  The y  4aw  the  page,  no  disambiguation 
would  have  been  possible.  In  that  case,  the  standard  represen¬ 
tation  of  the  noun  page  would  have  been 

C  2 
N 

TY (HU  » I N) 

CL (05) 

to  which  the  two  NF  rules  below  would  have  applied,  reflecting 
the  semantic  ambiguity. 

V  S0y  =  C  2  and  V  PAGZNA  =  C  2 

DO  $  TY  (HU)  DO  $  TY  ( IN) 

The  two  resulting  German  translations  would  then  be: 

Sit  iaktn  den  Knaben  and  Sit  taken  die  Stitt. 
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SECTION  II 

LEXICOGRaFBY 


2 .  i  Existing  Data 

The  German  and  English  lexicographic  data  which  were  avail¬ 
able  at  the  beginning  of  the  reporting  period  included  the  German 
monolingual  mach i ne-processab  le  dictionary,  two  English  mono¬ 
lingual  mach i ne-processab 1 e  dictionaries,  a  German  verb  list, 
an  English  verb  list,  and  German-Eng 1 i sh  past  participle  and 
noun  1  i  s  ts  . 


2.1.1  The  German  Dictionary 

The  German  dictionary  consists  of  approximately  40,000 
entries.  Since  stem  variants  of  nouns,  adjectives,  and  verbs 
constitute  separate  entries,  these  40,000  dictionary  entries 
.epresent  approximately  35,000  German  word  stems.  Each  entry 
is  classified  as  belonging  to  one  of  the  following  categories: 
noun,  adjective,  verb,  adverb,  determiner,  pronoun,  preposition, 
conjunction,  or  separable  verbal  prefix.  In  iddition  to  these 
categories,  paradigmatic  features  are  assigned  to  nouns,  ad¬ 
jectives,  and  verbs.  Nouns  have  a  feature,  ’gender,"  which 
identifies  them  as  irasculine,  feminine,  neuter,  or  (in  the  case 
ov  pluralia  tantum  i..juns)  plural.  Adverbs  which  may  be  used  to 
modify  nouns  are  marked  with  respect  to  their  position  relative 
to  the  modified  noun  phrase:  preposed  (e.g.,  6oga.fi  die  Roeme/i)  , 
or  postposed  (e.g  ,  diei>e>i  Sat z  hiefi)  . 


2.1.2  German  Lexical  Lists 

In  order  to  expand  the  LRC  machine  processable  diction¬ 
aries,  lists  of  German  '.erbs  and  of  past  participles  commonly 
used  as  adjectives  had  previously  been  compiled,  and  compilation 
of  a  list  of  German  rouns  had  begun.  All  information  was  coded 
from  the  German-English  English- German  Dictionary  by  Wildhagen 
and  Heraucourt  since  this  dictionary  contains  a  comparatively 
large  amount  of  explicit  syntactic  and  semo-syn tact i c  infor¬ 
mation. 


2. 1.2.1  The  German  Verb  List 

The  German  verb  list  originally  contained  approximately 
18,400  entries.  To  this  list,  i'll  stem  variants  of  irregular 
verbs  were  added  automatically,  'esulting  in  a  total  of  approx 
roately  30,000  entries.  The  entries  c'v.iain  a  large  amount  of 


syntactic  but  only  a  modicum  of  semo- syntact i c  information  (ap¬ 
proximately  for  12%  of  all  verb  entries). 

a)  Prefixes  precede  the  verb  stem.  Separable  prefixes  are 
followed  by  a  blank  space,  inseparable  prefixes  by  a  hyphen  and 
a  blank  space.  Examples: 

AUF  STEH  (separable  prefix) 

VER-  ZWEIFEL  (inseparable  prefix) 

Note  that  the  infinitive  endings  are  stripped  from  the  i  ;em. 

All  stem  variants  of  irregular  verbs  are  entered  in  the  dic¬ 
tionary  with  identical  semo-syn tact i c  and  syntactic  information, 
but  with  different  paradigmatic  information. 

b)  Transitivity.  Each  verb  in  the  v*rb  list  is  identified 
by  a  descriptor  as  transitive  (VT)  ,  intransitive  (VI),  or  re¬ 
flexive  (VR). 

c)  Case  government  is  indicated  for  all  transitive  verbs 
as  genitive  (GEN) ,  dative  (DAT),  or  accusative  (VT  or  VR) . 
Descriptors  indicating  case  go/ernment  may  also  contain  infor¬ 
mation  about  the  semantic  type  of  the  object: 


JDN 

= 

human,  accusative 

J  DM 

s 

human ,  dative 

JDS 

=* 

human ,  gen i t i ve 

ETW 

s 

non-human , 

accusative 

ETW 

DAT 

= 

non-human , 

dative 

ETW 

GEN 

- 

non -human , 

genitive 

VR 

DAT 

= 

reflexive, 

dative 

< 

1 

Ui 

X 

rec i proca 1 

( zinande.fi) 

ES 

= 

object  must 

be  z& 

d)  Verbs  which  govern  prepositional  objects  are  marked  by 
the  specific  ;>  repos  i  t  i  on  (s )  they  may  take  .  Those  prepositions 
which  govern  either  dative  or  accusative  are  distinguished  by 
case  descriptors: 

AN  ACC  ■  an  with  accusative 

AN  DAT  =  an  with  dative 


Prepositions  are  followed  by  descriptors  specifying  the  semantic 
type  of  object  required  (as  in  AUF  JDN),  or  by  SICH  (if  the 
prepositional  object  must  be  reflexive),  whenever  this  infor¬ 
mation  was  recognized  in  Wildhagen. 


e )  The  semantic  type  of  subject  required  by  a  verb  ( i f 
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indicated)  is  shown  by  (P)  for  human,  (T)  for  animal,  or  (S)  for 
i nan i mate  . 

f)  The  auxiliary  taken  by  a  verb  in  perfect  tense  forms 
is  indicated  as  follows:  ”  ~ 

takes  &&in  =  S 

takes  either  h abzn  or  tz-Ln  =  S  H 
takes  k abzn  =  unmarked 


2. 1.2. 2  The  Ge rman- Eng  1 i sh  Noun  List 

Work  on  the  compilation  of  a  list  of  German  nouns  had  seen 
in  progress  for  some  time.  The  information  coded  includes  gender, 
number  (for  pluralia  and  eingularia  tantum  nouns),  case  govern¬ 
ment  (including  prepositions)  for  deverbative  nouns,  and  English 
translation  equivalents.  Whenever  information  was  given  in 
Wildhagen  about  the  area  of  discourse  to  which  a  particular 
translation  is  restricted,  this  information  was  coded  with  the 
proper  translations. 


2. 1.2. 3  The  Ge rman-Eng 1 i sh  Past  Participle  List 

The  list  of  German  past  participles  consists  of  approxi¬ 
mately  1,100  entries.  it  contains  primarily  tho*e  past  parti¬ 
ciples  which  are  frequently  uced  as  adjectives  and  whose  meaning 
and  translation  cannot  be  automatically  derived  from  the  under¬ 
lying  verb  stem,  e.g.,  aufigzb'lazht  or  l ntznzt>  izfit ,  or  the  Eng¬ 
lish  adjective  zxc-Ltzd.  [Note  the  difference  in  meaning  between 
the  p  a c  t  participial  and  the  adjectival  usage: 

Thz  zlzztnon  u)0Lt>  zxcitzd.  (passive) 

Thz  man  Wat  zxcitzd.  (active)] 

Also  included  are  past  participles  whose  stem  does  nof  function 
as  a  verb  in  modern  German,  as  for  example,  bzituzAzt  (aghast), 
or  bztagt  (agzd) .  The  descriptors  coded  with  thes'i  entries 
indicate  case  government  (including  prepositions),  semantic  type 
of  object  required,  and  English  translation  equivalents. 


2.1.3  The  English  Dictionaries 

The  English  lexicographic  data  base  existing  at  the  begin¬ 
ning  of  this  reporting  period  consisted  of  two  monolingual 
mach i ne -p roces s ab l e  dictionaries:  a)  the  so-called  WEBSTER  dic¬ 
tionary,  based  on  Webster's  New  Collegiate  Dictionary ,  which  con¬ 
tains  approximately  77,500  entries  (47,300  nouns,  20,100  adjec¬ 
tives,  9,200  verbs,  plus  adverbs  and  function  words),  and 
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b)  the  so-called  LRMD,  which  was  derived  from  the  Russian  Master 
Dictionary  (RMD)  and  contains  approximately  47,300  entries  (34,000 
nouns,  7,800  adjectives,  4,800  verbs,  plus  adverbs  and  function 
words).  Each  word  stem  is  assigned  to  one  of  the  categories: 
noun,  adjective,  verb,  adverb,  determiner,  pronoun,  preposition, 
or  conjunction.  In  addition,  nouns,  adjectives,  and  verbs  are 
assigned  to  paradigmatic  classes  and  have  a  feature  indicating 
vocalic  or  consonantal  onset.  Nouns  in  the  LRMD  are  also  sub¬ 
classified  as  human  or  non-human.  A  small  set  of  adjecives  has 
a  subscript  identifying  them  as  possible  post-nominal  modifiers, 
e  .  g .  ,  . 


|  2.1.4  The  Eng’ish  Verb  List 

|  The  English  verb  list  was  compiled  from  The  Advanced  Lear- 

|  ner's  Dictionary  of  Current  English  by  Hornby,  Gatenby,  and  Wake- 

|  field,  and  contained  approximately  6,400  entries.  The  syntactic 

I  information  given  for  list  entries  included  the  permissible  types 

|  of  complementation  for  each  verb:  objects  (direct  and  indirect, 

|  either  of  which  may  be  in  the  form  of  a  prepositional  phrase), 

>  predicative  complements,  adjectives,  adverbials,  infinitives 

(unmarked  and  marked  by  to),  present  and  past  participles,  tkat- 
clauses,  interrogative  clauses,  gerunds,  and  combinations  of 
these . 


2  .  2  P  rog  res  s 

The  lexicographic  work  do-<e  during  the  reporting  period  con¬ 
sisted  of  the  revision  of  the  existing  lexical  lists,  the  addition 
of  translation  equivalents,  and  the  development  of  a  general  sys¬ 
tem  of  syntactic  and  s emo-syn tac t i c  features  to  be  used  in  the 
further  subclassification  of  lexical  elements. 


2.2.1  Purpose 

The  purpose  of  the  lexicographic  work  performed  at  the  Cen¬ 
ter  is  manifold: 

a)  to  make  the  LRC  mach i ne »p roces s ab 1 e  dictionaries  as 
comprehensive  as  possible  in  order  to  provide  for  maximum  recog¬ 
nition  of  lexical  elements  in  input  texts  and  for  all  necessary 
translation  equivalents; 

b)  to  prevent  ambiguous  readings  of  phrases  and  sen¬ 
tences  by  means  of  lexical  information; 

c)  to  permit  the  selection  (on  the  basis  of  lexical 
features)  of  the  proper  translation  equivalent  for  a  lexical  item 
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from  two  or  more  translation  equivalents; 

d)  to  guarantee  produ.;  ion  of  well-formed  sentences 

only. 

The  first  of  these  points  is  obvious.  The  following  exam¬ 
ples  may  illustrate  point  b) : 


They  6ent  the  mi.66i.le.  to  the  moon . 

_yes 


He  monitored  the  flight  to  the  moon. 


For  both  examples  (as  for  each  surface  Sentence  in  which  a  prepo¬ 
sitional  phrase  immediately  fellows  a  noun  phrase  in  -os t-pred 1  - 
cate  position),  there  are  two  possible  analyses.  One  analyzes 
the  adverbial  as  a  post-nominal  modifier  (represented  by  an  arrow 
above  the  sentence  in  the  examples  given).  The  other  analyzes  it 
as  a  verb  modifier  (arrows  below  the  sentences).  Given  the 
necessary  syntactic  or  semo-s yn tac t ! c  information,  such  sentences 
can  often  be  disambiguated.  in  this  Instance,  the  relevant  dis¬ 
tinction  lies  in  whether  the  verb  and  noun  may  be  modified  by  an 
adverb  of  airection.  (The  correct  analyses  of  the  examples  ^bove 
are  i ndi cated . ) 

An  example  for  point  c) ,  the  nee  i  for  selection  of  proper 
translations  based  on  lexical  informat’on,  is  the  German  verb 
abfiuette'in.  It  has  two  possible  English  translations:  fieed  if 
the  object  is  animate,  tine  if  the  object  is  inanimate  (more 
precisely,  articles  of  clothing).  The  choice  between  these  two 
translation  eiuivalents  can  easily  be  made  if  the  distinction 
between  animate  and  inanimate  oojects  is  made  in  the  verb  dic¬ 
tionary,  and  if  nouns  are  sub-classified  accordingly. 

Point  d) ,  the  production  of  well-formed  sentences,  is  clear: 
English  verbs  require  certoi.i  syntactic  patterns  and  may  not  be 
used  in  other  patterns.  For  example,  the  German  sente: ce 

Hie  eh.kla.en.ten  ihm  da.6  Phoblem. 

must  be  translated  as 

T'  .  y  explained  the  pnoblem  to  him. 

but  not 

*They  explained  him  the  problem. 

while  the  sentence 

Win.  gaben  die6em  Phaenomen  einen  neuen  N amen. 
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may  be  translated  as 

We  g~vc  a  new  name  .£0  tkl6  phenomenon, 
or  We  gave  in- pher  cmenon  a  new  name. 

The  information  coded  in  the  various  lexical  lists  will 
later  be  added  to  the  German  and  English  mach i ne-processab  1  e  die 
tionaries  of  the  Center.. 


2.2.2  Work  Done:  German 


2.2.2. 1  The  German-Enql i sh  Noun  List 

Compilation  of  the  German  noun  list  was  continued.  For  each  Ger 
man  entry  we  coded  English  translation  equivalents  and  any  ’•ele- 
vant  features  indicating  gender,  number  (for  tantum  nouns),  case 
government  (including  prepositions)  for  deverbative  nouns,  and 
area  of  discourse  or  stylistic  level  (e.g.,  <TECH>,  <MED>, 
<PHYS>}  etc.).  This  work  progressed  through  the  German  noun 
Exze46,  reaching  a  total  of  approximately  20,000  German  nouns. 


2. 2. 2. 2  Revision  of  the  German  Verb  List 

The  German  verb  list  was  revised  in  its  entirety.  This  re¬ 
vision  included  the  followiig: 

a)  correction  of  miscoded  or  mispunched  key-words  or 
descriptors,  and  addition  of  missing  entries; 

b)  addition  of  case  information  to  all  German  preposi 
tions  which  may  be  used  with  either  dative  or  accusative; 

c)  addition  of  the  descr!ptors  Zl  { zu- l ,  i.e. 
marked  infinitive)  and  DASS  (that-c 1 ause)  to  those  German  verb 
entries  which  take  one  or  both  of  these  verb  complements; 

a)  int.oduction  of  the  symbol  +  between  verb  comple 
ments  which  may  be  used  as  double  objects  with  the  particular 
verb. 


2. 2. 2. 3  Addition  of  the  Translation  Equ  valents 

The  English  translation  equivalents  given  in  Wildhagen  for 
German  verbs  were  added  to  the  revised  German  verb  list.  In  the 
process  of  this  work,  German  verb  entries  were  split  into  more 
than  one  entry  whenever  different  English  translations  for  a  Ger 
man  verb  could  be  associated  with  specific  groups  of  German  fea¬ 
tures  . 

Examp les : 


MEASURE  (EO  LAND) 
MEASURE  INCORRECTLY 


VER-  MESS  VT 
VER-  HESS  VR 
VER-  MESS  VR  +  ETW  GEN  Zl  ■  DARE,  VENTURE 

as  in: 

Sle  veAmaaen  dteie  Ge.ge.nd. 

They  meaiuAed  thti  aAea. 

Vabet  batten  &te  itch  veAmeaen. 

They  have  meaiuAed  tncoAAectZy  In  thti  caie. 

Sle  veAmeaen  itch,  diete  VeAmutung  aZi  Tatiachen 
hinzuiteZZe  u . 

They  daAe  to  AepAci ent  theie  aiiumptloni  ai  lacti . 

Additional  information  which  was  given  in  Wildhagen  for  the 
purpose  of  selecting  proper  translation  equivalents  was  coded 
with  each  English  translation  to  which  it  pertained.  This  type 
of  information  consists  of: 

r»)  the  area  of  disc^'irse  in  which  a  particular  trans¬ 
lation  would  be  used  (given  in  the  list  in  angled  brackets,  e.g., 
<PHYS>  <MED>  ,  etc. )  ;  or , 

b)  selection  restrictions  in  the  form  of  particular 
nouns  given  as  sample  subjects  or  objects  of  the  German  verb  or 
of  its  English  translation.  These  were  added  to  the  translations 
and  were  marked  as  English  or  German,  subject  or  object,  by  two 
preceding  letters:  ES  (English  subject),  EO ,  GS ,  or  GO. 

In  addition,  some  English  translation  equivalents  in  Wild¬ 
hagen  are  accompanied  by  syntactic  or  semo-syn tact i c  information. 
Such  data  was  incorporated  in  the  noun  list  in  the  form  of  four 


descriptors: 

AP 

=  a  person 

(human  object) 

AP  'S 

=  a  person 

's  (human  possessive  pronoun) 

ATH 

=  a  thing 

(inanimate  object) 

OJ 

=  oneself 

(reflexive  object) 

finally,  verb  entries  which  are  used  in  Wildhagen  in  a  verb 
phrase  with  the  German  verb  Zaiien  [Zet,  have ,  as  in  have  iome- 
one  do  iomethlng) were  marked  in  this  bilingual  verb  list.  This 
information  will  be  used  in  future  studies  of  verb  phrases  of 
this  ty pe . 


2.2.3  Work  Done:  English 

At  the  beginning  of  this  contract  period,  the  English  vero 
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list  (EVL)  consisted  of  6,5^7  entries  which  had  been  copied  from 
TTie  Advanced  Learner '  s  Dictionary  of  Current  English  by  Hornby, 
Gatenby,  and  Wakefield.  This  is,  to  our  knowledge,  the  only  dic¬ 
tionary  which  indicates  for  verbs  the  object  complement  and  ad¬ 
verbial  complement  environment  in  which  the  verb  may  occur. 

Apart  from  its  value  as  a  tool  for  linguistic  analysis,  the 
EVL  was  created  for  two  reasons:  to  guarantee  the  production  of 
well-formed  English  sentences,  and  to  be  able  to  associate  with 
a  particular  verb  the  syntactic  pattern  in  which  the  verb  can  be 
used  with  a  given  meaning. 


2.2.3. 1  Classification  in  the  Hornby  Dictionary 

In  addition  to  the  classification  indicated  by  the  patterns 
below,  verbs  are  redundantly  mark-id  as  transitive  or  intransitive 
if  this  is  applicable.  Verbs  which  require  a  reflexive  object 
are  marked  as  VR;  modals  and  auxiliaries,  as  "anomalous  finites". 

VERB  PATTERNS 


PI . 

P  2  . 

Verb  +  Direct  Object 

He  c.ut  hit,  fiingeA. 

Verb  +  {not)  to  +  Infinitive 

P  3  • 

He  intended  to  go. 

Verb  +  Noun  or  Pronoun  +  (not)  to  +  Infinitive 

Ph. 

I  told  the  servant  to  open  the  window. 

Verb  +  Noun  or  Pronoun  +  {to  be)  +  Complement 

“O 

We  pAoved  him  {to  be)  WAong. 

Verb  +  Noun  or  Pronoun  +  Infinitive 

P6 . 

They  1 [elt  the  houie  thake 

Verb  +  Noun  or  Pronoun  +  Present  Participle 

P  7 . 

They  le&t  me  standing  outride. 

Verb  +  Object  +  Adjective  (object  complement) 

P8 . 

The  iun  keep 4  u-6  waAm. 

Verb  +  Object  +  Noun  (object  complement) 

P9. 

They  named  theiA  6on  HenAy. 

Verb  +  Object  +  Past  Participle 

She  had  a  new  dAei-i  made. 
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P10.  Verb  +  Object  +  Adverbial  Adjunct 
Pa t  it  heae. 

PIT.  Verb  +  -tfoa-t-Clause 

He  explained  that  nothing  could  be.  done. 

PI 2.  Verb  +  Noun  or  Pronoun  +  ^fea^-Clause 

( tie  t>atit>  filed  oun&elveb  that  the  plan  Mould  t ooak. 

P 1 3 .  Verb  +  Interrogative  Adverb  (except  Mhu)  +  to  +  Infinitive 
He  -c4  learning  hoM  to  &Mim. 

P 1 4 .  Verb  +  Noun  or  Pronoun  +  Interrogative  Adverb  (except  Mhu) 

+  to  +  Infinitive 

The  patteanA  6hoM  you  hou)  to  make  &entence& . 

PI 5.  Verb  +  Interrogative  Adverb  +  Clause 
1  don't  mind  Mheae  Me  go. 

PI 6 .  Verb  +  Noun  or  Pronoun  +  Interrogative  Adverb  +  Clause 
They  aiked  ua  Mhen  Me  Mould  be  back , 

PI  7.  Verb  +  Gerund 

Group  A  -  replacing  the  gerund  with  an  infinitive  results 
i n  a  change  of  meaning. 

We  stopped  talking. 

We  stopped  to  talk. 

Group  B  -  the  gerund  may  be  replaced  by  an  infinitive  with¬ 
out  a  change  of  meaning. 

He  began  talking. 

He  began  to  talk. 

Group  C  -  the  gerund  is  equivalent  to  a  passive  infinitive. 
That  need*  explaining . 

That  needi  to  be  explained. 

PI 8.  Verb  +  Direct  Object  +  Indirect  Object 

Group  A  -  the  indirect  object  is  preceded  by  the  prepo¬ 
sition  to  and  may  occur  without  a  preposition  before  the 
direct  object. 

ThaoM  that  ball  to  me. 
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Thxou)  me  that  ball. 

Gtoup  B  -  the  indirect  object  is  preceded  by  the  prepo¬ 
sition  iox  and  may  occur  without  a  preposition  before  the 
direct  object. 

Have  you  leit  any  iox  youx  tittex? 

Have  you  leit  youx  tittex  any 1 

Group  C  -  covers  all  direct  object  +  indirect  object 
constructions  other  than  those  stated  in  Groups  A  and  B. 

I  explained  the  diHieulty  to  him. 

PI 9 .  Verb  +  Indirect  Object  +  Direct  Object 

Group  A  -  are  those  verbs  which  can  be  used  with  the 
preposition  to  in  Pattern  18A. 

He  handed  me  the  book. 

He  handed  the  book  to  me. 

Group  B  -  are  those  verbs  which  can  be  used  with  the 
preposition  iox  in  Pattern  18B. 

Buy  me  one. 

Buy  one  iox  me. 

Group  C  -  are  those  verbs  which  are  rarely  or  never  used 
in  Pattern  18. 

I  ttxuck  him  a  heavy  blow. 

P 2 0 .  Verb  +  (iox)  +  Complement  of  duration,  distance,  price 
or  weight 

The  xain  tatted  (iox)  a  whole  week. 

It  cott  ten  dollaxt . 

P21 .  Verb  alone 

These  are  intransitive  verbs.  Some  verbs  which  are  nor¬ 
mally  used  with  an  object  may  also  be  used  in  this  pat¬ 
tern,  the  object  being  understood. 

fixe  buxnt  . 

The  moon  xote. 

P22 .  Verb  +  Predicative  Complement 
Thit  it  a  boat. 
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P23 .  Verb  +  Adverbial  Adjunct 
We  mu&t  tar n  back. 

P24.  Verb  +  Preposition  +  Object 

The  verb  and  preposition  combine  to  form  a  new  transitive 
verb  followed  by  an  object  which  can  be  a  noun,  pronoun, 
gerund,  phrase  or  clause. 

Look  at  the.  blackboard. 

He  called  on  me. 

P25.  Verb  +  to  +  Infinitive 

Group  A  -  the  infinitive  is  one  of  purpose  or  aim. 

I  a lent  to  buy  &ome  book6. 

Group  B  -  the  infinitive  indicates  result  or  outcome. 

How  can  J  get  to  know  her? 

Group  C  -  the  infinitive  is  equivalent  to  a  co-ordinate 
clause . 

He  awoke  to  ^ ind  the  hou&e  on  £ire. 

Group  D  -  the  infinitive  is  the  main  verb. 

I  chanced  to  meet  him  in  the  park. 

Group  E  -  the  infinitive  is  used  after  finites  of  6e  for 
a  variety  of  meanings. 

Nobody  it  to  know. 

Thi&  I  wa&  to  learn  later. 

Group  F  -  contains  as  the  only  member  the  verb  going  to: 
He  -u  going  to  walk  home. 

2. 2. 3. 2  Frequency  of  Patterns  in  EVL,  1969 

The  entries  in  EVL  were  subjected  to  a  glossary  run.  The 
results  are  represented  in  Table  I,  which  follows. 
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TABLE  I:  Frequency  of  Patterns  in  EVL 
Pattern  Frequency  in  EVL 


PI .  4248 

P2 . 80 

P3 . 120 

P4 . 59 

P5 . 14 

P6 . 17 

P7 . 96 

P8 . 24 

P9 . 7 

P10 . 1670 

P10B . 1 

PI  1 . 181 

PI  2 . 26 

PI  3 . 42 

PI  4 . 8 

PI  5 . 61 

PI  6 . 10 

PI  7 . 14 

P17A . 55 

PUB . 16 

P17C . 3 

PI  8 . 910 

P18A . 17 

P18B . 80 

P18C . 8 

PI  9 . 76 

P19A . 16 

P19B . 9 

P19C . 10 

P20 . 78 

P21  .  2074 

P22 . 45 

P22D . 1 

P23 .  1  372 

P24 . 1121 

P24A . 1 

P24B . 1 

P25 . 139 

P25A . 1 

P25B . 1 

P29* . 312 


*P29  refers  to  verbs  which  were  not 
classified  in  the  Hornby  Dictionary. 
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2. 2. 3-3  Subsequent  Work 


The  purpose  of  the  work  performed  during  the  first  year  of 
the  contract  period  was  to  improve  EVL  by  making  tha  original 
classification  scheme  more  precise,  and  to  add  to  it  the  same 
semo-syntact i c  selection  restrictions  as  those  of  the  German  verb 
list. 


Thus,  two  of  the  Hornby  verb  patterns,  P10  and  P23,  were 
redefined.  Pattern  10,  for  which  Hornby  give*;  as  examples 

He  brought  hi t  bao the.fi  to  tee  me. 

They  taeat  the.il  tittea  at  H  the  we^te  only  a  tenvant. 
was  restricted  to 

Verb  »  Object  +  Movable  Adverbial  Particle 

He  took  Oj5({  hit  hat. 

He  took  hit  hat  o^. 

Similarly,  Pattern  23  was  defined  as 

Verb  +  Adverbial  Particle 

Get  up. 

Sit  down. 

The  actual  updating  of  EVL  involved: 

a)  the  addition  of  the  adverbial  particles  with 
which  each  verb  in  the  new  P10  and  P23  could  occur; 

b)  the  addition  of  the  p repos i t i on  (s )  which  each 
verb  in  P24  (Verb  +  Prepositional  Object)  required; 

c)  the  subclassification  of  the  verbs  in  the  general 
classes  P 1 7 .  P18,  and  P 1 9  into  the  corresponding  subclasses  A, 

B,  and  C  shown  above  in  2 . 2 . 3  •  1  ; 

d)  the  specific  classification  of  all  verbs  which 
had,  as  a  stop-gap  measure,  been  assembled  under  P29;  and 

e)  the  addition  of  the  descriptors  H,  N,  M,  K,  I 
(for:  human,  non-human  animate,  non-animate,  non-animate  con¬ 
crete,  non-animate  abstract,  respectively)  to  ali  patterns  in 
which  a  noun  phrase  object  complement  occurred. 

This  updating  process  resulted  In  a  new  EVL,  which  consists 
of  10,431  entries.  Comparison  of  frequency  of  descriptors  in 
the  new  and  the  original  EVL  Is  made  ir  Table  II,  which  follows. 
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TABLE  II:  Frequency  of  Patterns  in  New  and  Original  EVL 


Pattern 

PI  . 

• 

• 

• 

new  EVL 

.  4267 

• 

Original 
.  .4248 

P2  . 

• 

• 

• 

.  79 

• 

.  80 

P3  . 

• 

• 

• 

.  122 

• 

.  120 

P4  . 

• 

• 

• 

.  59 

• 

.  59 

P5  . 

• 

• 

• 

.  14 

• 

.  14 

P6  . 

• 

• 

• 

.  17 

• 

.  17 

P7  . 

•* 

• 

• 

.  92 

« 

.  96 

P8  . 

• 

• 

• 

.  22 

• 

.  24 

P9  . 

• 

• 

• 

7 

• 

7 

PIO. 

• 

• 

• 

.1269 

• 

.1570 

P10B 

• 

• 

• 

. 

• 

1 

Pll . 

• 

• 

• 

.  179 

• 

.  181 

PI  2 . 

• 

• 

• 

.  27 

• 

.  26 

PI  3 . 

• 

• 

• 

.  41 

• 

.  42 

FI  4 . 

• 

• 

• 

7 

• 

8 

PI  5 . 

• 

• 

• 

.  70 

• 

.  61 

PIS. 

• 

• 

• 

.  10 

• 

.  10 

PI  7 . 

• 

• 

• 

.  15 

• 

.  14 

P17A 

• 

• 

• 

.  80 

• 

.  55 

P17B 

• 

• 

• 

.  IS 

• 

.  16 

P17C 

• 

• 

• 

•  «* 

• 

3 

P18. 

• 

• 

• 

. 

• 

.  910 

P18A 

• 

• 

• 

.  63 

• 

.  17 

P18B 

• 

• 

• 

.  32 

• 

.  80 

P18C 

• 

• 

• 

.1743 

• 

8 

PI  9 . 

• 

• 

• 

. 

• 

.  76 

P19A 

• 

• 

• 

.  58 

• 

.  16 

P19B 

• 

• 

<* 

.  30 

• 

9 

P19C 

• 

• 

• 

.  14 

• 

.  10 

P20 . 

• 

• 

• 

.  76 

• 

.  78 

P21 , 

• 

• 

• 

.2166 

• 

.2074 

P22. 

• 

• 

• 

.  42 

• 

.  45 

P22D 

i 

• 

• 

, 

• 

1 

P23 . 

• 

• 

• 

.  866 

• 

.1372 

P24 . 

• 

• 

• 

.1778 

• 

.1121 

P24A 

• 

• 

• 

. 

• 

1 

P24B 

• 

• 

t 

. 

• 

1 

PI  25 

• 

• 

• 

.  139 

• 

.  139 

P25A 

• 

• 

• 

1 

• 

1 

P25B 

• 

• 

• 

1 

• 

1 

P29 . 

• 

• 

• 

. 

• 

.  312 

1 1-  14 


The  complete  list  of  entries  in  the  new  EVL ,  subdivided  as 
follow.*,  is  attached  to  the  report,  normalization  of  natural 
for  Information  Retrieval  by  Lehmann  and  Stachowitz. 


Language 

a) 


b) 


c) 


Verbs  which  are  both  transitive  and  irtransitive 
1)  consisting  of  more  than  one  wor r 
2}  consisting  of  one  word  only 

Verbs  which  are  transitive  only 
1)  consisting  of  more  than  one  word 
2}  consisting  of  one  word  only 

Verbs  which  are  intransitive  only 

1)  consisting  of  more  than  one  word 

2)  consisting  of  one  word  only 


d)  Verbs  with,  prepositional  object  or  double  object. 


2. 2. 3. ^  New  Classification 

The  experience  gained  during  this  year —  especially  through 
the  acquisition  of  English  translation  equivalents  for  German 
entries —  showed  that  the  classification  scheme  set  up  so  far  was 
not  adequate.  In  order  to  improve  disambiguation,  all  the  com¬ 
plement  types  with  which  a  verb  may  occur  must  be  listed  with 
their  semo- syntact i c  information.  Therefore  a  new  classification 
scheme  was  developed  by  the  German  group  and  is  described  in 
Section  2. 3.  which  follows. 


2 . 3  Development  of  a  General  Classification  System 

2.3.1  Purpose 

As  described  earlier  in  this  report,  the  Center's  lexical 
lists  already  contain  a  certain  amount  of  syntactic  and  semo- 
syntactic  information.  A  general  system  of  lexical  features  was 
developed  which  will  be  used  to  add  to  our  established  German  and 
English  noun  and  verb  lists  the  information  necessary  for  analy¬ 
sis  and  translation  and  in  future  work  on  the  classification  of 
German  and  English  adjectives.  Work  on  the  establishment  of  the 
necessary  f-ature  system  for  adverbials  will  be  undertaken  in  the 
com i ng  months . 

In  genera),  two  types  of  information  are  included  in  our 
feature  system: 
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a)  the  properties  of  the  classified  lexical  Item; 
th’s  informatlin  is  shown  as  a  value  (or  combination  of  values) 
of  the  subscript  TY  (type); 

b)  the  properties  of  the  environment  of  the  lexical 
item;  for  this  purpose,  several  subscripts  and  possible  values 
are  used  as  described  below. 

Note  that  some  semo-syntact i c  features  occur  as  syntactic 
features  to  facilitate  encoding  (cf.  the  subscript  RL  under 
nouns,  where  nouns  are  given  the  feature  "may  take  a  when- 
clause"  rather  than  the  feature  "noun  of  time"). 

In  general,  we  indicate  features  which  represent  surface 
phenomena.  If  we  find,  upon  Inspection  of  the  completed  lists, 
that  certain  features  can  be  predicted  from  the  occurrence  of 
others,  they  will  be  excluded  from  the  dictionary  and  intro¬ 
duced  by  means  of  redundancy  rules. 


2.3.2  Verb  Features 

Each  English  or  German  verb  will  be  given  some  or  all  of 
the  following  subscripts.  Certain  of  these  are  necessary  for 
all  verb  entries:  these  are  underlined  in  the  list  below.  Others 
are  relevant  only  in  one  of  the  languages  we  are  dealing  with; 
these  are  marked  by  G  for  German  and  E  for  English. 

TY  =  type  of  verb  (transitivity) 

TS  ■  semantic  type  of  subject 

FS  =  syntactic  form  of  subject  (this  subscript  is  omitted 
if  the  verb  allows  only  a  noun  phrase  as  subject) 

DSg  *  deep  subject  (indicated  only  if  the  deep  subject  does 
not  occur  as  a  nominative  In  the  surface  sentence) 

OB  =  syntactic  forn  of  object(s)  or  comp  1 emen t (s ) 

TO  =  semantic  type  of  object 

RA  =  required  adverbials 

OA  »  optional  adverbials 


2.3.2. 1  Values  for  Type  (TY) 

VT  =  takes  at  least  one  object  which  is  not  a  reflexive 
pronoun 

VTC  =  takes  a  cognate  object  only;  we  define  a  cognate  object 
as  the  true  cognate  and  all  nouns  subsumed  under  that 
term,  as  e.g.,  to  dance,  a  waltz  or  a  aain  dance. 

VR  -  takes  an  object  which  must  be  reflexive 
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VT,  VR  »  takes  at  least  two  objects,  one  of  which  must  be 
reflexive  and  one  which  is  not  reflexive 

VI  *  intransitive 

NP  *  the  verb  does  not  passivize;  verbs  marked  VI  or  VR  do 

not  need  this  descriptor. 

NG^  *  the  verb  does  not  form  the  progressive. 

2. 3*2. 2  Values  for  Type  of  Subject  (TS) 

The  values  which  may  be  associated  with  the  subscript  TS 
are  all  semantic  subcategories  of  nouns  (cf.  features  for  nouns 
below).  In  addition,  the  values 

E  =  entia  (any  type  of  noun) 

P  =  plural  noun  only 

may  be  used  to  describe  the  subject  a  verb  requires. 

2. 3- 2.3  Values  for  Form  of  Subject  ( F S ) 

NP  -  noun  phrase 

IT  =■  it 

TH  -  that-c 1 ause 

Ml  -  marked  infinitive 

FTg  =  fioA-to  complement 

GRf  =  gerund 

ICL  *  interrogative  clause 

IMIf  *  interrogative  adverb  +  marked  Infinitive 

IIq  *  interrogative  adverb  +  unmarked  infinitive 

2 . 3 • 2 . k  Values  for  Deep  Subject  (DS) 

G  **  genitive 

D  -  dative 

A  ■  accusative 

2. 3- 2. 5  Values  for  Object  or  Complement  Syntax  (OB) 

Gq  ■  genitive 

0G  ■  dative 
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°E 


accusat i ve 

noun  phrase  (NP)  as  object 


all  prepositions,  spelled  out;  German  prepositions  which  may 

govern  the  dative  or  accusative  are  marked  by  the  num¬ 
bers  1  (for  accusative)  or  2  (for  dative),  e.g.,  AN  1  , 


IN2  , 

etc . 

TH, 

Ml  . 

etc.  as  defined  above  for  FS 

CL 

= 

main 

(subjunctive)  clause 

PAPL 

= 

past 

participle 

1 

= 

unmarked  infinitive 

BC 

= 

takes 

be  +  NP  or  ADJ 

CM 

= 

takes 

optional  be  +  NP  or  ADJ  (e.g. 

,  think) 

NC 

at 

takes 

NP  complement  without  be  (e.g 

NA 

a 

takes 

NP  or  ADJ  complement  without 

be 

AC 

a 

takes 

ADJ  complement  without  be 

2. 3. 2. 6  Values  for  Type  of  Object  (TO) 

These  values  are  all  noun  sub-categories  (cf.  noun  features 
below),  plus  the  values 

E  =»  entia  (any  type  of  noun) 

P  •  plural  noun  only 

R  =  ref lexi ve 

RCC  =  reciprocal  (e.g.,  amlnandeA  ge.Aaze.n) 


2. 3-2. 7  Vaules  for  Required  Adverbials  (RA) 


PLC 

= 

place  (locative  o_r  directional) 

DIR 

= 

direction  to 

ORN 

= 

origin  (direction  from) 

TIM 

3 

time  (punctual  or  durational) 

PNC 

s 

punctua 1 

DUR 

= 

durational 

MAN 

a 

manner 

MSR 

= 

measure 

AC 

a 

adjective  complement  (for  sensory  verbs, 
Amzlt  good ) 
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2 . 3 . 2 ♦ 8  Value  for  Optional  Adverbials  (OA) 

The  subscript  O'",  is  always  associated  with  the  same  value: 
DOR  =  direction  or  origin  (adverb  of  directionality) 

2.3-3  Adjective  Features 

Adjectives  are  given  one  or  more  of  the  following  subscripts 
(only  MD  is  mandatory): 


TY 

* 

type  of  adjective 

Fu 

a 

form  of  adjective 

MD 

3 

modifies  nouns  of  the  specified  type 

RA 

a 

requires  an  adverb  (e.g.,  wohnha^t) 

OB 

a 

form  of  object 

TO 

* 

semantic  type  of  object 

2. 3- 3-1  Values  for  Type  of  Adjective  (TY) 

MSR  =  measurable  (e.g.s  wide  or  6tnong  as  in  fiive  inches 
wide.,  t>even  men  ithong) 

TM  ■  the  adjective  may  undergo  "tough  movement"  (e.g., 

hand,  ea&y) 

2. 3- 3-2  Values  for  Form  of  Adjective 

PRPL  *  the  adjective  is  in  form  i  present  participle 
PAPl  =  past  participle 

2. 3- 3-3  Values  for  Typ**  of  Noun  Modified  (MD) 

Ml  s  ub- ca  tegor  i  es  oi  ntuns  (cf.  noun  features  below) 

TH  *  that- ciause 

PLU  ■  plural,  mass,  or  collective  noun 

2-3-3-^  Values  for  Required  Adverbials  (RA) 

The  possible  values  for  the  subscript  RA  are  those  given 
for  the  subscript  RA  for  verbs  (cf.  verb  features  above). 
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2.3. 3-5  Values  for  Form  of  Object  (OB) 

Gg  =  gen i t i ve 

Dq  =  dative 

Aq  =  accus.ative 

All  prepositions,  spelled  out;  case  government  ambiguity  in 

German  prepositions  is  avoided  by  coding  1  (accusative) 
or  2  Native)  after  the  preposition. 

2. 3. 2. 6  Values  for  Type  of  Object  (TO) 

The  values  for  TO  are  all  sub-categories  of  nouns  (cf.  noun 
features  below),  and  E  'any  type  of  noun). 


2 . 3 • 4  Noun  Features 

Nouns  are  semantically  classified  and  in  addition  have 
descriptors  indicating  the  type  of  attributes  which  they  may 


take 

The  subscripts  for 

nouns  are: 

TY 

= 

type  of  noun 

SX 

ss 

sex 

OB 

m 

object  (in  case 
dependence.  on) 

of  deverbative  nouns,  as 

TO 

= 

semant i c  type  of 

ob j  ec  t 

TA 

- 

takes  attribute 

RL 

= 

relative  adverb 

(for  deverbative  nouns) 

DF 

= 

de  r i ved  f  rom 

FM 

= 

form  (for  nominalizcd  adjectives) 

2. 

,  3  •  .  1  Va  1  ues  for 

Type  (TY) 

PO 

= 

physical  ob j  ec  t 

AB 

3 

abs  tract 

AN 

= 

animate 

PL 

= 

plant 

1  N 

= 

inanimate 

HU 

= 

human 

AL 

.= 

animal 
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NM  *  proper  name 

CO  *  collective  (components  may  be  counted;  can  be  used 

with  the  verb  dispense;  e.g.,  gfioup,  hefid,  government) 

BP  *  body  part 

MS  *  mass  (homogeneous;  may  occur  without  article  in  the 

singular;  e.g.,  milk,  sand) 

MA  =*  machine  (since  they  can  perform  some  human  activities) 

QU  ■  quantity  ( _ +  (op  NP;  e.g.,  gfioup,  glass,  hal h, 

as  i  n  a  glass  oh  milk) 

CN  *  count  (abstract  countable  nouns,  e.g.,  idea) 

UN  *  unit  (ADV  =*  QUANT  +  _ e.g.,  mile.,  year,  as  in 

hive  miles  long,  to  wcut  two  years) 

These  values  may  be  used  in  combinations;  e.g.,  the  English 

noun  aovernment  which  has  the  features  TY  (HU  CO,  AB)  indicating 

both,  human  and  collective.  This  value  system  may  be  represented 

in  tree  form  as  shown: 


l.lA.2  Values  for  Sex  (SX) 

The  subscript  SX  has  two  possible  values:  MA  (male)  and 
FE  (female). 

2. 3. k.  3  Values  for  Object  (OB) 

The  values  for  the  subscript  OB  (if  relevant)  are  all  prepo 
sitions,  spelled  out,  and  followed  by  the  numbers  1  or  2  to 
indicate  case  government  when  the  German  preposition  occurs  with 


dative  or  accusative:  INI,  etc.. 

2. 3.^. A  Values  for  Type  of  Object  (TO) 

The  possible  values  for  the  subscript  TO  are  PO ,  AB,  etc., 
as  defined  under  TY  above. 

2. 3.4. 5  Values  for  Attributive  (TA) 

ZU  =  marked  infinitive  (e.g.,  attempt,  as  in  the  attempt  to 

do  something) 

CL.  *  main  clause,  as  in  die.  Be.ha.uptu.ng,  diei>  bei  die.  Itiaha- 

he.it 

Th  ■  ChaC- c 1 ause  (non-relative  that-  c  1  auses  ;  e.g.,  hi6  claim 
that  thi&  wat>  6o) 

DIR  =  directional  adverbial  complement  (e.g.,  a  tfiip  acfiobb 

Europe) 

2.3«*>.6  Values  for  Relative  Adverb  (RL) 

WO  =  where  (e.g.,  the.  place.  wheae  I  taw  you) 

WOHIN  »  whereto  (e.g.,  the.  town  wheae  you  went) 

WARUM  *  why  (e.g.,  the  fiea&on  why  he.  did  it) 

OB  =  whether  (e.g.,  the.  question  whcthe.fi  thi 4>  i&  &o) 

WIE  =  how  (e.g.,  die  Efiage,  wie  diet  geichehen  &e.i) 

ALS  =  when  (e.g.,  the  time  when  I  lived  theae) 

2.3-h.8  Values  for  From  (FM) 

The  subscript  FM  may  be  used  with  only  one  value:  A 
(adjective).  For  example,  the  Gernan  noun  de> l  (or  die)  AbtKuch- 
nige  ( the  fienegade)  is  coded  without  inflectional  ending  and  with 
the  ma rke r  FM (A)  : 

ABTRUENNIG  TY  (HU )  FM  (A)  . 
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SECTION  III 

PROGRAMMING 


During  this  reporting  period  the  programming  effort  was  di¬ 
vided  into  three  areas:  grammar  conversion  programs,  systems  pro¬ 
grams,  and  supporting  programs. 


3 . 1  Grammar  Conversion 

In  order  to  make  use  of  the  existing  IBM  7040  grammars  and 
dictionaries  it  was  necessary  to  convert  them  to  a  format  suit¬ 
able  to  the  CDC  6600.  The  Remote  File  Management  System  (RFMS) , 
which  was  being  developed  to  facilitate  management  of  very  large 
data  bases,  was  chosen.  This  system  of  programs  allows  the  user 
to  define  a  data  base  in  tree  format  with  no  restriction  on  the 
number  of  branches  or  levels.  It  is  based  on  a  completely  inver¬ 
ted  file  system,  and  the  updating  and  retrieval  features  it  al¬ 
lows  are  based  on  set  theoretical  operations 


3.1.1  Remote  File  Management  System  (RFMS  FI) 

The  first  conversion  was  to  what  will  be  called  RFMS  FI. 

This  was  simply  an  intermediate  conversion  designed  to  retain 
the  information  that  was  used  by  the  IBM  7040  programs.  The  RFMS 
FI  Data  Base  definition  is  as  follows: 

1 ]LE VEL  RULE  NUMBER  (NAME); 

3] DEGREE  (NAME); 

4] LEFT  SIDE  TERM  (TEXT); 

6] RIGHT  SIDE  TERM  (RG); 

62JRIGHT  SIDE  SYMBOL  (TEXT  IN  6); 

6 3 ]  B  OPERATOR  (NAME  IN  6); 

64] S  OPERATOR  (NAME  IN  6); 

7] TYPF  WEIGHT  INFORMATION  (RG); 

71JTYPE  (NAME  IN  7); 

72]WEIGHT  (NAME  IN  7); 

The  Data  Base  is  constructed  of  rules  whose  entries  each 
have  a  component  number  (e.g.,  3]),  a  name  (e.g.,  DEGREE),  and 
a  data  type  (e.g.,  (NAME)).  (RG),  "repeating  group",  allows  the 
following  set  of  components  to  be  repeated.  In  the  above  case. 
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te  rms 


each  rule  has  only  one  left  side  but  can  have  any  number  of 
on  the  right  side. 

Both  the  English  (ENG)  and  German  (GER)  machine  processable  dic¬ 
tionaries  and  their  syntactic  and  normal-form  grammars  were  con¬ 
verted  from  IBM  7040  to  RFHS  FI.  The  ENG  dictionary  wac  maoe  up 
of  RMD  and  WEBSTER,  which  were  in  different  formats. 


3.1.2  Remote  File  Management  System  (RFMS  F2) 

To  allow  the  writing  of  grammars  containing  rules  in  terms 
of  complex  symbols  composed  of  subscripts,  values,  operators, 
macro  statements,  dummy  statements,  and  choice  statements,  RFMS 
F2  was  designed  and  is  defined  as  follows: 

1 ]  RULE  NUMBER  (NAME) ; 

2 ]  RUL E  TYPES  (RG)  ; 

21 ] RULE  TYPE  (NAME  IN  2)  ; 

3]  DEGREE  (NAME); 

4] MACR0  (RG); 

42 ]  M  CATEGORY  SYM  (NAME  IN  4); 

4 3 ]  H  SUBSCRIPT  (RG  IN  4)  ; 

431 ] M  OP  1  (NAME  I N  43) ; 

432 ]  M  OP  2  (NAME  IN  43)  ; 

433  3  M  LOCATOR  (NAME  IN  43)  ; 

434 ]  M  SUBSCRIPT  SYM  (NAME  IN  43); 

4 3 5 ]  M  VALUE  (RG  IN  43)  ; 

4351 3  M  BINARY  OP  (NAME  IN  435); 

4352 ] M  UNARY  OP  (NAME  IN  435); 

4 3 5 3  3  M  VALUE  SYM  (NAME  IN  435); 

436 ]  H  SLASH  (NAME  IN  43)  ; 

52 ] L  CATEGORY  SYM  (NAME) ; 

533  L  SUBSCRIPT  (RG)  ; 

531 ] L  OP  1  (NAME  IN  53)  I 

532 ]  L  OP  2  (NAME  IN  53)  ; 

533]  L  LOCATOR  (NAME  IN  53); 

534]  L  SUBSCRIPT  SYM  (NAME  IN  53); 

535  3  L  VALUE  (RG  IN  53)  ; 

535  1  ]  L  BINARY  OP  (NAME  IN  535); 
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5352 ] L  UNARY  OP  (NAME  IN  535); 

535 3 1 L  VALUE  SYM  (NAME  IN  535); 

536] L  SLASH  (NAME  IN  53) ; 

5**]L  OP  (RG)  ; 

541] L  OP  SYM  (NAME  IN  54)  ; 

542] L  OP  VALUE  (NAME  IN  54); 

55] L  CHOICE  (RG)  ; 

5 5 1 ]  L  CHOICE  NUMBER  (NAME  IN  55); 

552 ]  L  CHOICE  COMMAND  (NAME  IN  55); 

553]  L  CHOICE  VALUE  1  (NAME  IN  55); 

554]  L  CHOICE  VALUE  (RG  IN  55); 

554 1 ] L  CHOICE  VALUE  2  (NAME  IN  554); 

6] R  SIDE  (RG) : 

6 1 ]  R  CATEGORY  OP  (NAME  IN  6); 

62 ]  R  CATEGORY  SYM  (NAME  IN  6); 

63]  R  SUBSCRIPT  (RG  IN  6)  ; 

631 ] R  OP  I  (NAME  IN  63)  ; 

6 32 ]  R  OP  2  (NAME  IN  63)  ; 

633]  R  LOCATOR  (NAME  IN  63)  ; 

634]  R  SUBSCRIPT  SYM  (NAME  IN  63); 

635]  R  VALUE  (RG  IN  63)  ; 

635 1 ]  R  BINARY  OP  (NAME  IN  635); 

6352 ]  R  UNARY  OP  (NAME  IN  635); 

6353]  R  VALUE  SYM  (NAME  IN  635); 

636]  R  SLASH  (NAME  IN  63)  ; 

64 ] R  OP  (RG  IN  6)  ; 

64 1 ]  R  OP  SYM  (NAME  IN  64)  ; 

642 ]  R  OP  VALUE  (NAME  IN  64)  ; 

65 ]  R  CHOICE  (RG  IN  6)  ; 

65 1  3  R  CHOICE  NUMBER  (NAME  IN  65); 

652]  R  CHO ICE  OP  (NAME  IN  65)  ; 

653]  R  CHOICE  SUBSCRIPT  SYM  (NAME  IN  65); 

654]  R  CHOICE  VALUE  (RG  IN  65); 

654 1 ] R  CHOICE  BINARY  OP  (NAME  IN  654); 
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65^2 ] R  CHOICE  UNARY  OP  (NAME  IN  654); 
6543 J R  CHOICE  VALUE  2  (NAME  IN  654); 

7 ]  DUMMY  (KG) ; 

72 ]  □  CATEGORY  SYM  (NAME  IN  7); 

73]  0  SUBSCRIPT  (RG  IN  7) ; 

73 1 J 0  OP  1  (NAME  IN  73) ; 

732 ]  D  OP  2  (NAME  IN  73)  ; 

733]  D  LOCATOR  (NAME  IN  73); 

734 ]  0  SUBSCRIPT  SYM  (NAME  IN  73); 

735]  0  VALUE  (RG  IN  73) ; 

7351 ] D  BINARY  OP  (NAME  IN  735); 

7352  ]  D  UNARY  OP  (NAME  IN  735); 

7353 ] D  VALUE  SYM  (NAME  IN  735); 

736 ]  D  SLASH  (NAME  IN  73)  ; 

74]  D  OP  (RG  IN  7) 

741 ] D  OP  SYM  (NAME  IN  74)  ; 

742 ]  D  OP  VALUE  (NAME  IN  74)  ; 

8]  TYPE  WEIGHT  PROBABILITY  (RG); 

8 1 ]  TWP  ASSOCIATION  NUMBER  (NAME  IN  8); 

82 ]  TYP E  (NAME  I N  8)  ; 

83 ]  WE  I GHT  (NAME  IN  8)  ; 

84] PROBABILITY  (NAME  IN  8)  ; 

9 ]  T  RANS  FE  R  CROSS  REFERENCE  (RG); 

9 1 ] T  RANS  FE  R  ROLE  NUMBER  (NAME  IN  9); 


The  German  RFMS  Fl  dictionary  was  converted  to  the  RFMS  F2 
format,  and  work  was  begun  toward  the  conflation  of  the  incom¬ 
plete  English  RMD  and  WEBSTER  dictionaries  and  their  ultimate 
conversion  to  RFMS  F2 . 

Work  was  also  done  toward  the  conversion  of  the  normal-form 
grammars  to  RFMS  F2  format.  As  it  was  not  possible  to  tell 
whether  the  interlingual  substitution  symbols  were  constructed 
of  GER,  ENG,  or  RUS  (Russian)  transfer  names,  it  was  necessary 
to  set  up  a  complicated  conversion  procedure.  This  involved 
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classifying  a  greater  part  of  the  160,000  interlingual  substitu¬ 
tion  symbols  by  hand. 

When  the  normal-form  grammars  are  converted,  all  such  sym¬ 
bols  will  be  reduced  to  their  English  part,  and  duplicate  rules 
will  be  eliminated,  resulting  in  much  smaller  normal-form  gram¬ 
mars  . 


3 . 2  Systems  Programs 

/ 

The  following  systems  programs  were  designed  for  the  dic¬ 
tionary  phase  of  the  translation  system: 

a)  grammar  sort  (DICT  GS) 

b)  tree  construction  (D'CT  TC) 

c)  analysis  (DICT  A) 

d)  text  display  for  DICT  A  (MATRIX) 

e)  choice  (DICT  C) 

f)  workspace  display  for  DICT  A  and  DICT  C. 

A  subscript  grammar  program  (SUB  GRM)  was  designed  for  con¬ 
version  of  linguistic  coding  format  into  the  full  RFMS  F2  format. 

These  programs  are  described  below  in  3-^. 

3 . 3  Supporting  Programs 

Supporting  programs  were  designed  to: 

a)  update  the  working  lexical  lists  (LIST  UP),  cf .  3  •  ** ; 

b)  produce  new  concordances  ( R E Q,  CON),  cf.  3.**; 

c)  collect  statistical  data; 

d)  automate  time-consuming  linguistic  operations; 

e)  convert  working  lexical  lists  into  an  intermediate  format 

for  subseguent  conversion  into  subscript  format; 

f)  recognize  poly-word  entries  in  dictionary  rules; 

g)  selectively  display  dictionary  rules  according  to  type  or 

class  name; 

h)  generate  allomorphs  for  the  German  verb  list,  producing 

30,000  entries  from  an  original  17,000; 

i)  add  class  names  occurring  in  the  form  prefix-stem  to  entries 

in  the  German  dictionary; 

j)  convert  the  German  noun  list  to  an  intermediate  format  more 
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amenable  to  updating  and  conversion  to  subscript  format, 
i.e.,  each  specific  kind  of  information  is  assigned  a  spe¬ 
cific  line  number. 

The  old  grammar  display  program  was  expanded  to  include: 

a)  an  analysis  sort  which  sorts  terms  right-to-left,  and 

b)  a  dictionary  sort  with  the  constituents  of  the  right-side 

terms  concatenated. 


3 . 4  Program  Descriptions 

3.4.1  Dictionary  Analysis  ( D I CT  A) 

Using  the  compiled  dictionary  tree  constructed  by  DICT  TC , 
the  dictionary  analysis  program  (DICT  A)  analyzes  text  and  gene¬ 
rates  a  workspace  to  be  used  by  the  dictionary  cho!ce  (DICT  C) 
program.  The  compiled  tree  is  initially  loaded  onto  the  disk  in 
random  format  and  the  maximum  number  of  blocks  possible  is  kept 
in  memory  at  all  times  during  analysis.  Statistics  are  kept  con¬ 
cerning  the  ise  of  each  block  in  memory.  If  a  new  block  must  be 
added,  the  previously  loaded  block  with  the  least  amount  of  ac¬ 
cesses  is  d I scardf d . 

DICT  A  has  ,vo  input  parameters,  the  K-option  indicator  and 
the  display  indicator.  If  the  K-option  is  on,  the  rules  inter¬ 
preting  endings  are  applied  everywhere  except  after  a  punctuation 
mark  or  a  space.  If  the  K-option  is  off,  these  rules  are  applied 
only  after  a  morpheme  boundary.  The  display  indicator  selects 
the  sort  option  for  the  display  of  the  resulting  workspace.  These 
options  are  from-to  sorts,  to-from  sorts,  or  both. 

for  each  file  entry,  the  display  contains  the  rule  which 
applied  and  its  number,  the  items  "FROM"  (text  position  where  the 
entry  begins),  "TO"  (text  position  where  the  entry  ends),  and  a 
condition  code  for  the  application  of  rules  interpreting  the  im¬ 
mediate  right  context. 

The  analysis  program  creates  a  table  containing  entries  of 
text  character  sequences  which  match  the  compiled  tree.  Eacn 
table  entry  contains  three  items  of  information  concerning  the 
sequence:  the  location  of  the  node  in  the  tree,  the  starting 
character  (or  file),  and  the  number  of  characters  at  this  point. 

The  text  consists  of  N  characters  (numbered  from  1  to  N) . 

For  each  character  position  I,  an  associated  file  I  is  created 
which  contains  entries  whose  termina.  strings  end  at  position  E. 
Entries  for  file  1  ure  referred  to  as  ciE.l,  FEI.2,  etc.. 

Every  sequence  of  characters  defining  a  terminal  in  the  tree 
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has  as  its  second  character  a  B,  E,  or  blank  represented  by  ° 

(see  DICT  GS) .  Thus  at  this  point  the  node  will  be  either  a  B, 

E,  or  if  more  than  one  character  occurs  at  this  point  in  the 

cree,  these  characters  will  be  linked  together  by  down  pointers 
indicating  branches  in  the  tree. 

For  each  text  character  processed,  a  new  table  entry  is  con¬ 
structed  if  that  character  may  begin  a  sequence. 

Each  table  entry  already  constructed  is  processed  as  follows: 

a)  a  new  fiie  entry  is  constructed  if  a  sequence  ended  in 

the  last  file; 

b)  the  table  entry  is  updated  by  the  new  node  position  and 

the  character  count  is  either  destroyed  or  incremented 

according  to  whether — 

(1)  the  sequence  does  or  does  not  continue  as  part  of 

another  rule,  and 

(2)  the  second  character  of  the  sequence  is  or  is  not 

being  processed; 

c)  the  starting  branch  conditions  are  evaluated  (as  opposed 

to  character  matches  being  performed  as  in  the  cases 

above) ,  i f — 

(1)  a  sequence  continues  as  part  of  another  rule,  and 

(2)  the  second  character  of  the  sequence  is  being 

processed . 

If  the  second  character  of  the  sequence  being  processed  is: 
B,  the  string  may  not  begin  if  the  previous  file — 

(1)  does  not  contain  an  interpreted  string, 

(2)  contains  a  punctuation  mark,  or 

(3)  contains  a  blank; 

E,  the  string  may  begin: 

°,  the  string  may  not  begin  if  the  previous  file  does  not  contain 

an  interpreted  string. 

During  the  processing  of  the  second  character,  the  first 
reference  to  the  table  entry  modifies  the  entry.  All  future 
references  create  new  table  entries.  After  all  table  entries  for 
the  character  are  processed,  the  table  is  resorted  to  put  the 
longest  sequence  first,  if  and  only  if  there  were  any  multiple 
second -cha rac te r  table-entry  constructions. 

As  each  new  file  entry  is  constructed,  the  left-side  opera¬ 
tors  M,  and  °  are  used  to  compute  the  value  for  the  FROM  fiie. 
This  value  will  be  used  by  the  following  file  to  determine 
whether  the  new  file  may  be  constructed.  If  the  second  character 
is  «  P  and  the  value  of  the  previous  file  indicates  a  blank  or 
punctuation  mark,  a  nev.  file  entry  is  completed. 
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3.4.2  Dictionary  Choice  (DICT  C) 

DICT  C  processes  the  from-to  workspace  output  from  DICT  A. 

It  discards  all  file  entries  from  the  workspace  which  do  not  be¬ 
long  to  a  sequence  of  rules  which  span  M-symbols.  (The  H-symbols 
are,  primarily,  blanks  or  punctuation  marks*  including  hyphens.) 

It  also  generates  K-rules  for  all  H-symbol  sequences  which  are 
not  spannec.  It  has  four  input  parameters.  ■*  first  sets  the 
K-option  either  on  or  off  (cf.  3.4.1).  The  sec  d  secs  the 
preference- (P-)opt ion  either  on  or  off.  The  third  records  which 
workspace  display  is  requested  for  output —  the  options  being  any 
choice  or  combination  of:  to-from,  from-to,  or,  all  deleted  file 
entries.  The  fourth  parameter  indicates  whether  the  from-to 
workspace  should  be  saved  or  destroyed.  (Word  analysis  uses 
workspace  in  the  to-from  format.) 

DICT  C  reads  in  file  entries  until  it  finds  a  group  which 
completely  spans  two  H-symbols.  It  processes  this  group  and  then 
reads  in  the  next  group. 

The  first  operation  performed  is  the  elimination  of  all  file 
entries  from  this  group  which  have  right-side  F-operators  and  are 
not  followed  by  an  M-symbol.  An  F-operator  is  assigned  to  all 
rules  for  which  only  punctuation  or  °  can  follow. 

If  the  P-option  is  on,  all  other  sequences  or  file  entries 
covering  the  same  span  are  discarded  from  any  file  entry  having  a 
left-side  P-operator.  The  P  operator  in  a  rule  gives  preference 
to  a  long  span  over  two  or  more  short  spans. 

The  riles  used  in  all  possible  sequences  covering  the  span 
are  tagged  for  later  processing.  Processing  for  this  span  is 
terminated  when  a  possible  sequence  is  found  without  M-syn:bols 
resulting  from  a  rule  with  a  multi-word  right-side.  If  a  possible 
sequence  with  an  internal  M-symbol  is  found,  a ) 1  possible  se¬ 
quences  are  calculated  for  each  subspan.  If  r.  subspan  !s  not 
completely  covered,  s  K-rule  is  generated.  when  the  K-option  is 
on,  additional  K-rules  are  constructed  whi.:h  link  together  all 
possibilities  for  prefixes  and  suffixes. 

If  the  original  span  was  not  covered,  a  K-rule  is  generated 
to  cover  it.  Additional  K-rules  are  also  generated  for  sequences 
of  the  form:  prefix-K,  pref i x-K- s uf f i x ,  and  K-suffix;  and  each  of 
these  sequences  covers  the  original  span. 


3.4.3  Dictionary  Grammar  Sort  (DICT  GS) 

DICT  GS  has  two  major  functions —  a)  to  determine  the  re¬ 
strictions  on  the  application  of  a  particular  rule,  and  b)  to 
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sort  the  dictionary  grammar  according  to  the  right-side  term(s) 
and  the  application  restriction  information  (ARI). 

The  dictionary  contains  sixty-four  roots.  From  each  root 
four  branches  may  theoretically  extend  which  represent  the  re¬ 
strictions  for  all  terminals.  These  branches  are  the  [P]-re- 
striction,  the  [° ] - res tr i ct i on ,  [ B ] - res tr i ct i on  ,  and  [E]-re- 
striction.  The  [P] -restr i ct i on  Indicates  that  the  rule  may  apply 
to  a  string  which  is  preceded  by  a  punctuation  mark  or  blank. 

Both  the  [°]-restriction  and  the  [B]-restriction  indicate  that 
the  rule  may  apply  to  a  string  which  is  contiguous  to  a  preceding 
interpreted  string.  The  [ Bj - res t r i c t i on  also  indicates  that  the 
span  must  not  be  preceded  by  a  punctuation  mark  or  blank.  The 
[E] -restr i ct ion  indicates  that  the  rule  may  apply  anywhere;  there 
are  no  restrictions  in  this  case. 

To  construct  a  grammar  tree,  the  ARI  of  the  rules  needs  to 
be  retained.  Therefore,  depending  upon  the  ARI  in  the  rule,  the 
program  DICT  GS  inserts  a  "B",  "E",  or  "°".  (The  [P] - res t r i ct i on 
is  included  under  the  °-indlcator  at  this  point.  In  the  surface 
dictionary  analysis,  a  distinction  is  made.) 

DICT  GS  strips  RFMS  loader-format  repeating-group  names  and 
extraneous  information  from  the  rule.  The  program  generates  sort 
keys,  consisting  of  the  right-side  terms  and  the  ARI,  and  retains 
the  left-side  terms  and  ARI  as  data.  The  ARI  indicator  is  the 
second  character  in  the  sort  key. 

E  ch  rule  in  the  dictionary  grammar  is  converted  to  the  fol¬ 
lowing  form  in  DICT  GS — 

Word  1:  Length  of  rule  (revised  for  SORT/MERGE 

routine),  H,  and  length  of  sort  key,  N; 

Words  2  -*•  (N+l):  Sort  key; 

Words  (N+2)  -►  M:  Sort  data  area. 

The  program  then  sorts  the  rules  in  the  dictionary  grammar, 
which  are  in  the  form  listed  above. 

Finally,  DICT  GS  creates  a  new  tape  consisting  of  two  re¬ 
cords.  The  first  record  contains  information  concerning  the 
length  of  the  longest  sort  key  created,  the  length  of  the  longest 
data  area  created,  and  the  date  the  new  tape  was  created.  The 
second  record  contains  the  sorted  dictionary  grammar.  This  new 
file  is  used  as  input  to  the  dictionary  tree  construction  program. 

3.  **.4  Dictionar  Tree  Construction  (DICT  TC) 

DICT  TC  builds  t  e  compiled  dictionary  tree  and  its  index 
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from  the  output  of  0 1 CT  GS.  It  reads  In  one  entry  at  a  time, 
comparing  it  character  by  character  with  the  previous  entry. 

Where  the  character  strings  differ,  a  down  pointer  is  attached  to 
the  previous  string  to  indicate  the  piace  where  the  new  string 
continues.  If  the  old  string  is  a  subset  of  the  new  string,  a 
continuation  (or  right)  pointer  is  attached  to  the  end  of  the  old 
string.  In  both  cases,  after  all  the  characters  are  placed  in 
the  tree,  the  remaining  information  (e.g.,  the  rule  number  and 
the  left-side  of  the  rule)  is  added  at  the  end  of  the  string. 
Another  new  entry  is  read  in  and  the  process  is  repeated.  Each 
time  a  new  first  character  is  encountered,  a  pointer  is  placed  in 
the  index  table.  Thus,  after  the  process  is  completed,  there  is 
a  pointer  to  the  beginning  of  every  character  tree.  The  index 
and  the  compiled  tree  are  then  written  out  in  a  form  suitable  for 
use  by  0 1 CT  A . 

3.4.5  Subscript  Grammar  (SUB  GRM) 

SUB  GRM  converts  subscript  rules  from  the  form  in  which  the 
linguists  encode  them  into  RFMS  F2  Loader  Input  format.  Rule 
numbers  and  duplication  numbers  are  optional  input.  All  rules 
containing  format  errors  are  discarded. 


3.4.6  List  Update  (LIST  UP) 

LIST  UP  updates  all  the  working  lexical  lists.  These  are  in 
the  form  of  card  images,  each  of  which  is  indexed  by  corpus,  re¬ 
quest,  and  line  numbers.  LIST  UP  allows  additions,  deletions, 
insertions,  and  replacements  on  a  card- for-card  basis.  The  out¬ 
put  consists  of  all  requests  plus  all  changes,  or  only  those  re¬ 
quests  for  which  a  change  was  made,  and  a  new  updated  tape. 


3.4.7  Concordance  Program  (REQ  CON) 

A  new  concordance  program  was  constructed  having  the  follow¬ 
ing  features: 

a)  A  display  of  the  concorded  word  in  the  context  of 
the  entire  request  (identified  by  the  digits  in  columns  4-7) 

b)  Forward  and/or  backward  sorts.  Each  sort  includes 
all  the  words  in  the  request.  In  the  forward  sort  the  re¬ 
quest..  in  the  following  succession,  is  used  to  determine  the 
list  order  for  the  concorded  word — 

1)  concorded  word 

2)  words  in  sequence  to  the  right  of  the  concor¬ 

ded  word 
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3)  a  "zero  word",  inserted  at  the  end  of  the  re¬ 
quest,  which  takes  precedence  over  any  other 
word  at  the  same  point 

k)  words  in  sequence  to  the  ieft  of  the  concorded 
word . 

A  backward  sort  takes  the  words  to  the  left  first,  and  then 
the  words  to  the  right,  inserting  the  "zero  word"  at  the 
beginning  of  the  request. 

c)  An  inclusion/exclusion  option 

d)  A  glossary  of  all  concorded  words  with  their 
f requenc i es 

e)  A  choice  of  no  display  or  any  of  three  forms  of 
output  display—  all  the  requests,  only  those  requests  which 
were  used,  or,  the  requests  not  used 

f)  Standard  or  non-standard  procedure  for  ccr.cording 
words.  Standard  is  based  on  the  occurrence  of  the  word 
itself;  non-standard  refers  to  words  preceded  by  a  special 
character.  For  the  latter,  pre-processing  programs  for 
tagging  the  words  to  be  concorded  may  be  required. 

g)  Concordance  restrictable  to  specified  sequences  of 
starting  characters.  This  capability  permits  the  recovery 
of  information  when  the  capacity  of  the  computer  is  ex¬ 
ceeded  . 


I  M-  1 1 


CONCLUSION 


Progress  under  the  contract  has  been  good,  in  spite  of  re¬ 
duced  funding.  The  theory  underlying  the  Linguistics  Research 
System  has  been  developed.  The  linguistic  descriptions  which 
are  necessary  to  implement  this  theory,  however,  have  not  met 
our  original  projections,  because  of  lack  of  manpower.  Program¬ 
ming  has  also  suffered  from  the  reduction  in  funding. 


During  the  remainder  of  the  contract  period  a  lexicon  will 
be  produced  which  will  have  "precise  information  on  the  syntactic 
and  semantic  properties  of  lexica)  items".  Preliminary  grammars 
as  required  for  the  implementation  of  the  Linguistics  Research 
System  will  be  produced. 


Much  of  the  programming  effort  has  been  concerned  with 
bringing  our  linguistic  data  into  the  formats  required  by  the 
Linguistics  Research  System,  and  with  updating  the  Center's  If  •- 
cal  data  bases.  In  the  last  two  years  of  work  under  the  contract, 
programs  will  be  constructed  for  handling  the  grammars  described 
in  Section  I  of  this  report  ana  the  German  and  English  lexical 
data . 


REFERENCES 


1.  Bob  row,  D.G.,  and  J.B.  Fraser.  1968.  "An  augmented  state 

transition  network  analysis  procedure,"  in  Proceed- 
inge  of  the  International  Joint  Conference  on  Arti¬ 
ficial  Intelligence.  Washington,  D.C.. 


2.  Chomsky,  Noam.  1 965 .  Aspects  of  the  Theory  of  Syntax.  Cam 
bridge,  Mass.:  M.i.T.  Press. 


3.  Earley,  Jay.  1970.  "An  efficient  context-free  parsing  algo¬ 
rithm."  Communications  of  the  ACM  13:9i*"102. 


k.  Hornby,  A.S.,  E.V.  Gatenby,  and  H.  Wakefield.  1963.  The 

Advanced  Learner's  Dictionary  of  Current  English.  2nd 
ed.  London:  Oxford  University  Press. 


5.  Lehmann,  W.P.,  and  Rolf  A.  Stachowitz.  1970.  Research  In  Ger- 
man-Engllsh  Machine  Translation  on  Syntactic  Level. 
Voi.  II.  (RADC-TR-69-368 . )  Austin:  Linguistics 
Research  Center,  The  University  of  Texas  at  Austin. 


6.  _ .  1972.  Normalisation  of  Natural  Language  for  In¬ 

formation  Retrieval.  Final  Technical  Report.  Austin: 
Linguistics  Research  Center,  The  University  of  Texas 
at  Aus  tin. 


7.  Petrick,  Stanley  R.  I965.  A  Recognition  Procedure  for  Trans¬ 
formational  Grammars.  Ph.D.  dissertation,  M.I.T.. 


8.  _ .  1971.  "Syntactic  analysis  for  transformational 

grammars,"  in  Feasibility  Study  on  Fully  Automatic 
High  Quality  Ti-^nslationt  by  W.P.  Lehmann  and  Rolf  A. 
Stachowitz.  Austin:  Linguistics  Research  Center, 

The  University  of  Texas  at  Austin. 

9.  Thorne,  J.,  P.  Bratley,  and  H.  Dewar.  1968.  "The  syntactic 

analysis  of  English  by  machine,"  in  Machine  Intelli¬ 
gence  3 ,  D.  Michle,  ed..  New  York:  American  Elsevier. 

10.  Walker,  D.E.,  P.G.  Chapin,  M.L.  Geis,  and  L.N.  Gross.  1986. 

"Recent  Developments  in  the  MITRE  Syntactic  Analysis 
Procedure."  Bedford,  Mass.:  The  MITRE  Corp. 


R-l 


11.  Walker,  D.E.,  P.G.  Chapin,  H.L.  Gels,  and  L.N.  Gross.  1966. 

■'Recent  Developments  in  the  M'TRE  Syntactic  Analysis 
Procedure."  Bedford,  Mass.:  The  MITRE  Corp.. 


12.  Webster’s  Sew  Collegiate  Dictionary,  i960.  Springfield, 
Mass.:  G.  &  C.  Merriam  Co.. 


)3.  Wildhagen,  Karl,  and  Will  H^raucourt.  1953*  1959*  English- 
German  German  English  Dictionary  t  2  vols.  London: 
George  Allen  &  Unwin;  Wiesbaden:  Brandstetter . 


1 4 .  Woods,  W.A.  1970.  "Transition  network  grammars  for  natural 
language  analysis."  Communications  of  the  ACM  13: 
591-606. 


15.  Zwicky,  A.M.,  O.E.  Walker,  J.  Friedman,  and  B.C.  Hall.  1965. 

"The  MITRE  syntactic  analysis  procedure  for  trans¬ 
formational  grammars,"  in  Proceedings  of  the  1965 
Fall  Joint  Computer  Conference .  New  York  and  Wash¬ 
ington,  D.C.:  Spartan. 


